home *** CD-ROM | disk | FTP | other *** search
Text File | 1994-09-14 | 122.7 KB | 3,183 lines | [TEXT/MPS ] |
- 29 August 94
- Version 1.21 of pccts
-
- This help file is available via anonymous FTP at:
-
- Node: everest.ee.umn.edu [128.101.144.112]
- File: /pub/pccts/1.22/NOTES.newbie
-
- Mirror sites for pccts:
-
- Europe:
-
- Node: ftp.th-darmstadt.de [130.83.55.75]
- Directory: pub/programming/languages/compiler-compiler/pccts
-
- According to the FAQ this is updated daily.
-
- Node: src.doc.ic.ac.uk
- Directory: /computing/programming/languages/tools/pccts
-
- Unable to reach this node from my site.
- Also:
-
- Node: ftp.uu.net
- Directory: languages/tools/pccts
-
- Mail corrections or additions to moog@polhode.com
- ===============================================================================
- Miscellaneous
- -------------------------------------------------------------------------------
- 1. NEVER choose rule names, #token names, #lexclass names, #errclass
- names, etc. which coincide with the reserved words of your C or C++
- compiler. Be awake to name collisions with your favorite libraries
- and #include files. One can only imagine the results of definitions like:
-
- #token FILE "file"
-
- const: "[0-9]*"
-
- 2. Tokens begin with uppercase characters. Rules begin with lowercase
- characters.
-
- 3. When passing the name of the start rule to the ANTLR macro don't
- forget to code the trailing function arguments:
-
- /* Not using ASTs */ ANTLR (grammar(),stdin);
- /* Using ASTs */ ANTLR (grammar(&ASTroot),stdin);
-
- /* *** Wrong *** */ ANTLR (grammar,stdin);
-
- 4. When you see a syntax error message that has quotation marks on
- separate lines:
-
- line 1: syntax error at "
- " missing ID
-
- that probably means that the offending element contains a newline.
-
- Page 2
-
- 5. Even if your C compiler does not support C++ style comments,
- you can use them in the *non-action* portion of the ANTLR source code.
- Inside an action (i.e. <<...>> ) you have to obey the comment
- conventions of your compiler.
-
- 6. To place main() in a ".c" file rather than a grammar file (".g")
- place:
-
- #include "stdpccts.h"
-
- before invoking the ANTRLR macro. Contributed by N.F. Ross.
-
- 7. ANTLR counts a line which is continued across a newline using
- the backslash convention as a single line. For example:
-
- #header <<
- #define abcd alpha\
- beta\
- gamma\
- delta
- >>
-
- This will cause line numbers in ANTLR error messages to be off by 3 compared
- to most text editors.
-
- 8. The Purdue Computer Science Department maintains a WWW directory
- which includes a pccts page:
-
- URL http://tempest.ecn.purdue.edu:8001/
-
- Page 3
-
- 9. Suppose one wants to parse files that "include" other files. The
- code in ANTLR (antlr.g) for handling #tokdefs statements demonstrates
- how this may be done.
-
- grammar: ...
-
- | "#tokdefs" QuotedTerm
-
- <<{
-
- zzantlr_state st; /* defined in antlr.h */
- struct zzdlg_state dst; /* defined in dlgdef.h */
- FILE *f;
-
- UserTokenDefsFile = mystrdup(LATEXT(1));
- zzsave_antlr_state(&st);
- zzsave_dlg_state(&dst);
- f = fopen(StripQuotes(LATEXT(1)),"r");
- if ( f==NULL ) {
- warn(eMsg1("cannot open token defs file '%s'",
- LATEXT(1)+1));}
- else {
- ANTLRm( enum_file(), f, PARSE_ENUM_FILE);
- UserDefdTokens = 1;
- }
- zzrestore_antlr_state(&st);
- zzrestore_dlg_state(&dst);
- }>>
-
- The code uses zzsave_antlr_state() and zzsave_dlg_state() to save the state
- of the current parse. The ANTLRm macro specifies a starting rule for ANTLR
- of "enum_file" and starts DLG in the PARSE_ENUM_FILE state rather than the
- default state (which is the current state - whatever it might be). Because
- enum_file() is called without any arguments it appears that enum_file() does
- not use ASTs nor pass back any attributes. Contributed by Terence J. Parr.
-
- 10. If an action becomes too large then it will overflow an ANTLR buffer
- ("... error: action buffer overflow: size 4000"). If you don't wish to
- recompile ANTLR with a larger value for ZZLEXBUFSIZE you can put the
- action in an include file and then place a #include in the action
- Suggested by David Seidel (dseidel@delphi.com).
-
- Splitting an action into two smaller actions will not work if the second
- action needs to refer to zzlextext.
-
- 11. When one is using multiple input files (for example "a.g" and "b.g"
- to generate "a.c" and "b.c") the only way to place file scope information
- in b.c is to place it in #header of the first grammar file. ANTLR won't
- allow file scope information to be copied from b.g into b.c using
- "<<...>>" notation. If one did place a file scope action in the b.g, ANTLR
- would try to interpret it as the fail action of the last rule appearing in
- a.g. (the first grammar file). The workaround is to #include b.c in
- another file which has your file scope declarations. You'll probably
- need to #include "stdpccts.h" before your file scope definitions.
-
- 12. Multiple parsers can coexist in the same application through use of
- the #parser directive (in C output mode). The #parser statement is not used
- with the ANTLR C++ output option because one can simply instantiate a new
- parser object.
-
- Page 4
-
- 13. Don't confuse #[...] with #(...).
-
- The first creates a single AST node (usually from a token identifier and
- an attribute) using the routine zzmk_ast(). The zzmk_ast() routine must be
- supplied by the user (or selected from one of the pccts supplied ones such
- as pccts/h/charbuf.h, pccts/h/charptr.*, and pccts/h/int.h).
-
- The second creates an AST list (usually more than a single node) from other
- ASTs by filling in the "down" field of the first node in the list to create
- a root node, and the "sibling" fields of each of remaining ASTs in the
- list. A null pointer is put in the sibling field of the last AST in the
- list. This is performed by the pccts supplied routine zztmake().
-
- #token ID "[a-z]*"
- #token COLON ":"
- #token STMT_WITH_LABEL
-
- id! : ID <<#0=#[STMT_WITH_LABEL,$1];>> /* a1 */
-
- Creates an AST. The AST (a single node)
- contains STMT_WITH_LABEL in the token
- field - given a traditional version of
- zzmk_ast().
-
- rule! : id COLON expr /* a2 */
- <<#0=#(#1,#3);>>
-
- Creates an AST list with the ID at its
- root and "expr" as its first (and only) child.
-
- The following example (a3) is equivalent to a1, but more confusing because
- the two steps above have been combined into a single action statement:
-
- rule! : ID COLON expr
- <<#0=#(#[STMT_WITH_LABEL,$1],#3);>> /* a3 */
- Page 5
-
- ===============================================================================
- Section on switches and options
- -------------------------------------------------------------------------------
- 14. Don't forget about the ANTLR -gd option which provides a trace of
- rules which are triggered and exited.
-
- The trace option can be useful in sometimes unexpected ways. For example,
- by suitably defining the macros zzTRACEIN and zzTRACEOUT before the
- #include of "antlr.h" one can accumulate information on how often each
- rule is invoked.
-
- 15. When you want to inspect the code generated by ANTLR you may want to
- use the ANTLR -gs switch. This causes ANTLR to test for a token being
- an element of a lookahead set by using explicit tests with meaningful
- token names rather by using the faster bit oriented operations which are
- difficult to read.
-
- 16. When using the ANTLR -gk option you probably want to use the DLG -i
- option. As far as I can tell neither option works by itself.
- Unfortunately they have different abbreviations so that one can't
- use the same symbol for both in a makefile.
-
- 17. When you are debugging code in the rule section and there is no
- change to the lexical scanner, you can avoid regeneration of scanner.c
- by using the ANTLR -gx option. However some items from stdpccts.h
- can affect the scanner, such as -k -ck and the addition of semantic
- predicates - so this optimization should be used with a little care.
-
- 18. One cannot use an interactive scanner (ANTLR -gk option) with the
- ANTLR infinite lookahead and backtracking options (syntactic predicates).
-
- 19. If you want backtracking, but not the prefetching of characters and
- tokens that one gets with lookahead, then you might want to try using
- your own input routine and then using ANTLRs (input supplied by string)
- or ANTLRf (input supplied by function) rather than plain ANTLR which
- is used in most of the examples.
-
- See Example 4 below for an example of an ANTLRf input function.
-
- 20. The format used in #line directive is controlled by the macro
-
- #define LineInfoFormatStr "# %d \"%s\"\n"
-
- which is defined in generic.h. A change requires recompilation of ANTLR.
-
- Prior to version 1.21 ANTLR -gl will sometimes place #line directives in
- a column other than column 1. The temporary workaround is to change
- the format string to:
-
- #define LineInfoFormatStr "\n# %d \"%s\"\n"
-
- This is reported as fixed in versions >= 1.21
-
- 21. To make the lexical scanner case insensitive use the DLG -ci
- switch. The analyzer does not change the text, it just ignores case
- when matching it against the regular expressions.
-
- The problem in version 1.10 with the -ci switch is fixed in versions >= 1.20.
-
- Page 6
-
- 22. In order to use a different name for the mode.h file it is necessary
- to supply the new name using both the ANTLR -fm switch and the DLG -m switch.
- ANTLR does not generate mode.h, but it does generate #include statements
- which reference it.
-
- ===============================================================================
- C++ Mode
- -------------------------------------------------------------------------------
- 23. When using backtracking (syntactic predicates) ANTLRtoken should be
- derived from ANTLRCommonBacktrackingToken rather than ANTLRCommonToken.
- Page 7
-
- ===============================================================================
- Section on #token, #tokclass, #tokdef #errclass (but not #lexclass)
- -------------------------------------------------------------------------------
- 24. To gobble up everything to a newline use: "~[\n]*".
-
- 25. To match any single character use: "~[]".
-
- 26. The char * array "zztokens" in err.c contains the text for the name of
- each token (indexed by the token number). This can be extremely useful
- for debugging and error messages.
-
- 27. If a #token symbol is spelled incorrectly in a rule it will not be
- reported by ANTLR unless the ANTLR -w2 option is set. ANTLR will assign
- it a new #token number which, of course, will never be matched. Look at
- token.h for misspelled terminals or inspect "zztokens[]" in err.c.
-
- 28. If you happen to define the same #token name twice (perhaps
- because of inadvertent duplication of a name) you will receive no
- error messages from ANTLR or DLG. ANTLR will simply use the later
- definition and forget the earlier one. Using the ANTLR -w2 option
- does not change this behavior.
-
- 29. One cannot continue a regular expression in a #token statement across
- lines. If one tries to use "\" to continue the line the lexical analyzer
- will think you are trying to match a newline character.
-
- 30. The escaped literals in #token regular expressions are not identical
- to the ANSI escape sequences. For instance "\v" will yield a match
- for "v", not a vertical tab.
-
- \t \n \r \b - the only escaped letters
-
- 31. In #token regular expressions spaces and tabs which are
- not escaped are ignored - thus making it easy to add white space to
- a regular expression.
-
- #token symbol "[a-z A-Z] [a-z A-Z 0-9]*"
-
- 32. You can achieve a limited form of one character lookahead in the
- #token statement action by using zzchar which contains the character following
- the regular expression just recognized.
-
- 33. The regular expressions appearing in #errclass declarations must
- be unique.
-
- Page 8
-
- 34. You cannot supply an action (even a null action) for a #token
- statement without a regular expression. You'll receive the message:
-
- warning: action cannot be attached to a token name
- (...token name...); ignored
-
- This is a minor problem when the #token is created for use with
- attributes or ASTs nodes and has no regular expression:
-
- #token CAST_EXPR
- #token SUBSCRIPT_EXPR
- #token ARGUMENT_LIST
-
- <<
- ... Code related to parsing
- >>
-
- ANTLR assumes the code block is the action associated with the #token
- immediately preceding it. It is not obvious what the problem is because
- the line number referenced is the END of the code block (">>") rather
- than the beginning. My solution is to follow such #token statements
- with a #token which does have a regular expression (or a rule).
-
- 35. Since the lexical analyzer wants to find the longest possible string
- that matches a regular expression, it is probably best not to use expressions
- like "~[]*" which will gobble up everything to the end-of-file.
-
- 36. The lexical routines zzmode(), zzskip(), and zzmore() do NOT work like
- coroutines. Basically, all they do is set status bits or fields in a
- structure owned by the lexical analyzer and then return immediately. Thus it
- is OK to call these routines anywhere from within a lexical action. You
- can even call them from within a subroutine called from a lexical action
- routine.
-
- See Example 5 below for routines which maintain a stack of modes.
-
- Page 9
-
- 37. When a string is matched by two #token regular expressions of equal
- length, the lexical analyzer will choose the one which appears first in
- the source code. Thus more specific regular expressions should appear
- before more general ones:
-
- #token HELP "help" /* should appear before "symbol" */
- #token symbol "[a-zA-Z]*" /* should appear after keywords */
-
- Some of these may be caught by using the DLG switch -Wambiguity.
- Consider the following grammar:
-
- #header <<
- #include "charbuf.h"
- >>
- <<
- int main() {
- ANTLR (statement(),stdin);
- return 0;
- }
- >>
-
- #token WhiteSpace "[\ \t]" <<zzskip();>>
- #token ID "[a-z A-Z]*"
- #token HELP "HELP"
-
- statement
- : HELP "@" <<printf("token HELP\n");>> /* a1 */
- | "inline" "@" <<printf("token inline\n");>> /* a2 */
- | ID "@" <<printf("token ID\n");>> /* a3 */
- ;
-
- Is an in-line regular expression treated any differently than a regular
- expression appearing in a #token statement ? No! ANTLR/DLG does *NOT*
- check for a match to "inline" (line a2) before attempting a match to the
- regular expressions defined by #token statements. The first two
- alternatives ("a1" and "a2") will NEVER be matched. All of this will be
- clear from examination of the file "parser.dlg".
-
- If one builds this example using the DLG switch -Wambiguity one gets
- the message:
-
- dlg warning: ambigious regular expression 3 4
- dlg warning: ambigious regular expression 3 5
-
- The numbers which appear in the DLG message refer to the assigned token
- numbers. Examine the array zztokens[] in err.c to find the regular
- expression which corresponds to the token number reported by DLG.
-
- ANTLRChar *zztokens[6]={
- /* 00 */ "Invalid",
- /* 01 */ "@",
- /* 02 */ "WhiteSpace",
- /* 03 */ "ID",
- /* 04 */ "HELP",
- /* 05 */ "inline"
- };
-
- One can also look at the file "scan.c" in which action 4 would
- appear in the function "static void act4() {...}".
-
- The best advice is to follow the example of the Master, TJP, and place
- things like #token ID at the end of the grammar file.
-
- Page 10
-
- 38. The DLG lexical analyzer is not able to backtrack. Consider the
- following example:
-
- #token "[\ \t]*] <<zzskip();>>
- #token ELSE "else"
- #token ELSEIF "else [\ \t]* if"
- #token STOP "stop"
-
- with input:
-
- else stop
-
- When DLG gets to the end of "else" it realizes that the spaces will allow
- it to match a longer string than "else" by itself. So DLG starts to accept
- the spaces. When DLG gets to the initial "s" in "stop" it realizes it has
- gone too far - but it can't backtrack. It passes back an error status to
- ANTLR which (normally) prints out something like:
-
- invalid token near line 1 (text was 'else ') ...
-
- There is an "extra" space between the "else" and the closing single quote
- mark.
-
- The fix for this particular example was to make ELSE accept the trailing
- spaces.
-
- #token "[\ \t]*] <<zzskip();>>
- #token ELSE "else [\ \t]*"
- #token ELSEIF "else [\ \t]* if"
-
- In other cases you may want to use #lexclass. You'll need to eliminate
- trailing space in the action part of the #token statement or have an
- attribute processing routine that can tolerate trailing spaces. In many
- cases the actual text of the reserved words is unimportant since only the
- token number is used.
-
- This problem is not detected by the DLG option -Wambiguity.
-
- 39. In converting a long list of tokens appearing in a rule to use #tokclass
- I simply replaced the rule, in situ, with the #tokclass directive and did a
- global replace of the rule name with a new name in which the first letter
- was capitalized. It took me a while to realize that the ANTLR message:
-
- xxx.g, line 123: warning: redefinition of tokclass or conflict
- w/token 'Literal'; ignored
-
- meant that I had used the #tokclass "Literal" before it was defined.
- Only rules, not tokens, can be used in forward references. The problem
- was fixed by moving the #tokclass statement up to the #token section of
- the file.
-
- 40. The char * variable zzbegexpr and zzendexpr point to the start and end
- of the string last matched by a regular expression in a #token statement.
-
- However, the char array pointed to by zzlextext may be larger than the
- string pointed to by zzbegexpr and zzendexpr because it includes substrings
- accumulated through the use of zzmore().
-
- Page 11
-
- 41. The preprocessor symbol ZZCOL in the lexical scanner controls the
- update of column information. This doesn't cause the zzsyn() routine to
- report the position of tokens causing the error. You'll still have to
- write that yourself. The problem, I think, is that, due to look-ahead,
- the value of zzendcol will not be synchronized with the token causing the
- error, so that the problem becomes non-trivial.
-
- 42. If you want to use ZZCOL to keep track of the column position
- remember to adjust zzendcol in the lexical action when a character is not
- one print position wide (e.g. tabs or non-printing characters).
-
- 43. In version 1.00 it was common to change the token code based on
- semantic routines in the #token actions. With the addition of semantic
- predicates in 1.06 this technique is now frowned upon.
-
- Old style:
-
- #token TypedefName
- #token ID "[a-z A-Z]*"
- <<{if (isTypedefName(LATEXT(1))) NLA=TypedefName;};>>
-
- New Style:
-
- #token ID "[a-z A-Z]*"
-
- typedefName : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1>> ID;
-
- See the section on semantic predicates for a longer explanation.
-
- Page 12
-
- 44. DLG has no operator like grep's "^" which anchors a pattern to the
- beginning of a line. One can use tests based on zzbegcol only if column
- information is selected (#define ZZCOL) AND one is NOT using infinite
- lookahead mode (syntactic predicates). A technique which does not depend
- on zzbegcol is to look for the newline character and then enter a special
- #lexclass.
-
- Consider the problem of recognizing lines which have a "!" as the first
- character of a line. A possible solution suggested by Doug Cuthbertson
- is:
-
- #token "\n" <<zzline++; zzmode(BEGIN_LINE);>>
-
- *** or ***
-
- #token "\n" <<zzline++;
- if (zzchar=='!') zzmode(BEGIN_LINE);>>
-
- #lexclass BEGIN_LINE
- #token BANG "!" <<zzmode(START);>>
- #token "~[]" <<zzmode(START); zzmore();>>
-
- When a newline is encountered the #lexclass BEGIN_LINE is entered. If
- the next character is a "!" it returns the token "BANG" and returns
- to #lexclass START. If the next character is anything else it calls
- zzmore to accumulate additional characters for the token and, as before,
- returns to #lexclass START. (The order of calls to zzmode() and zzmore()
- is not significant).
-
- There are two limitations to this.
-
- a. If there are other single character tokens which can appear in the first
- column then using zzmore() won't be sufficient to work around the problem
- because the entire (one character) token has already been consumed. Thus
- all single character tokens which can appear in column 1 must appear in
- both #lexclass START and #lexclass BEGIN_LINE.
-
- b. The first character of the first line is not preceded by a newline.
- thus DLG will be starting in the wrong state. Thus you might want to rename
- "BEGIN_LINE" to "START" and "START" to "NORMAL".
-
- Another solution is to use ANTLRf (input from a function) to insert
- your own function to do the kind of lexical processing which is difficult
- to express in DLG.
-
- In 1.20 the macro ANTLRm was added. it is similar to ANTLR, but has an
- extra argument which allows one to specify the lexical class which is
- passed to zzmode() to specify the initial #lexclass state of DLG.
-
- 45. In version 1.10 there were problems using 8 bit characters with DLG.
- Versions >= 1.20 of ANTLR/DLG work with 8 bit character sets when they are
- compiled in a mode in which char variables are by default unsigned (the
- g++ option "-funsigned-char"). This should be combined with a call to
- setlocale (LC_ALL,"") to replace the default locale of "C" with the user's
- native locale. This is system dependent - it works with Unix systems but
- not DOS. Contributed by Ulfar Erlingsson (ulfarerl@rhi.hi.is).
-
- See Example 4 below.
- Page 13
-
- ===============================================================================
- Section on #lexclass
- -------------------------------------------------------------------------------
- 46. Special care should be taken when using "in-line" regular expressions
- in rules if there are multiple lexical classes #lexclass). ANTLR will
- place such regular expressions in the last lexical class defined. If
- the last lexical class was not START you may be surprised.
-
- #lexclass START
- ....
- #lexclass COMMENT
- ....
-
- inline_example: symbol "=" expression
-
- This will place "=" in the #lexclass COMMENT (where
- it will never be matched) rather than the START #lexclass
- where the user meant it to be.
-
- Since it is okay to specify parts of the #lexclass in several pieces
- it might be a good idea when using #lexclass to place "#lexclass START"
- just before the first rule - then any in-line definitions of tokens
- will be placed in the START #lexclass automatically.
-
- #lexclass START
- ...
- #lexclass A
- ...
- #lexclass B
- ...
- #lexclass START
-
- 47. A good example of the use of #lexclass are the definitions for C
- and C++ style comments, character literals, and string literals which
- can be found in pccts/lang/C/decl.g - or see Example 1 below.
-
- 48. The initial #lexclass of DLG is set by a data statement to START
- (which is 0). Unlike ANTLRm, the traditional ANTLR macros (ANTLRf, ANTLRs,
- and ANTLR) do NOT reset the #lexclass. If you call ANTLR multiple times
- during a program (for instance to parse each statement of a line-oriented
- language independently) DLG will resume in the #lexclass that it was in
- when ANTLR returned. If you want to restart DLG in the START state you
- should precede the call to ANTLR with
-
- zzmode(START);
- or use:
- ANTLRm (myStartRule(),myStartMode);
-
- Page 14
-
- 49. Consider the problem of a grammar in which a statement is composed
- of clauses, each of which has its own #lexclass and in which a given
- word is "reserved" in some clauses and not others:
-
- #1;1-JAN-94 01:23:34;enable;a b c d;this is a comment;
- #2;1-JAN-94 08:01:56;operator;smith;move to another station;
- #3;1-JAN-94 09:10:11;move;old pos=5.0 new pos=6.0;operator request;
- #4;1-JAN-94 10:11:12;set-alarm;beeper;2-JAN-94 00:00:01;
-
- One would like to reuse a #lexclass if possible. There is no problem with
- maintaining a stack of modes (#lexclass numbers) and pushing a new mode
- onto the stack each time a new #lexclass subroutine is called. How to do
- this is demonstrated in Example 5. The problem appears when it is
- necessary to leave a #lexclass and return more than one level. To be more
- specific, a #token action can only be executed when one or more characters
- is consumed - so to return through three levels of #lexclass calls would
- appear to require the consumption of at least three characters. In the
- case of balanced constructs like "...", and '...', or (...) this is not a
- problem since the terminating character can be used to trigger the #token
- action. However, if the scan is terminated by a "separator", such as the
- semi-colon above (";"), one cannot use the same technique. Once the
- semi-colon is consumed it is unavailable for the other #lexclass routines
- on the stack to see.
-
- My solution is to allow the user to specify (during the call to pushMode)
- a "lookahead" routine to be called when the corresponding element of the
- mode stack is popped. At that point the "lookahead" routine can examine
- zzchar to determine whether it also wants to pop the stack, and so on up
- the mode stack. The consumption of a single character can result in
- popping multiple modes from the mode stack based on a single character of
- lookahead. See the second part of Example 5 below.
-
- Continuing with the example of the log file (above): each statement type
- has its fields in a specific order. When the statement type is recognized
- a pointer is set to a list of the #lexclasses which is in the same order as
- the remaining fields of that kind of statement. An action attached to
- every #token which recognizes a semi-colon (";") advances a pointer in
- the list of #lexclasses and then changes the #lexclass by calling zzmode()
- to set the #lexclass for the next field of the statement.
- Page 15
-
- ===============================================================================
- Section on rules
- -------------------------------------------------------------------------------
- 50. If a rule is not used (is an orphan) it can lead to unanticipated
- reports of ambiguity. Use the ANTLR cross-reference option (-cr) to
- locate rules which are not referenced. Not verified in version 1.20.
-
- 51. ANTLR attempts to deduce "start" rules by looking for rules which
- are not referenced by any other rules. When it finds such a rule it
- assumes that an EOF token ("@") should be there and adds one if the
- user did not code one. This is the only case, according to TJP, when
- ANTLR adds something to the user's grammar.
-
- 52. To express the idea "any single token is acceptable at this point"
- use the "." token wild-card. This can be very useful for providing a
- context dependent error message, rather than the all purpose message
- "syntax error".
-
- if-stmt : IF "\(" expr "\)" stmt
- | IF . <<printf("If statement requires expression ",
- "enclosed in parenthesis");
- PARSE_FAIL;
- >>
-
- It is probably best not to use expressions such as:
-
- ignore: (.)* /* Not a good idea */
-
- which will gobble up everything to the end-of-file.
-
- 53. New to version 1.20 is the "~" operator for tokens. It allows
- one to specify tokens which must NOT match in order to match a rule.
-
- The "~" operator cannot be applied to rules. To express the idea
- "if this rule doesn't match try to match this other rule" use
- syntactic predicates.
-
- 54. Some constructs which are bound to cause warnings about
- ambiguities:
-
- rule : a { ( b | c )* };
-
- rule : a { b };
- b : ( c )*;
-
- rule : a c*;
- a : b { c };
-
- rule : a { b | c | };
-
- Page 16
-
- 55. Don't confuse init-actions with actions which precede a rule. If the
- first element following the start of a rule or sub-rule is an action it is
- always interpreted as an init-action.
-
- An init-action occurs in a scope which include the entire rule or sub-rule.
- An action which is NOT an init-action is enclosed in "{" and "}" during
- generation of code for the rule and has essentially zero scope - the
- action itself.
-
- The difference between an init-action and an action which precedes a rule
- can be especially confusing when an action appears at the start of an
- alternative:
-
- These APPEAR to be almost identical, but they aren't:
-
- b : <<int i=0;>> b1 > [i] /* b1 <<...>> is an init-action */
- | <<int j=0;>> b2 > [j] /* b2 <<...>> is part of the rule */
- ; /* and will cause a compilation error */
-
- On line "b1" the <<...>> appears immediately after the beginning of the
- rule making it an init-action. On line "b2" the <<...>> does NOT appear at
- the start of a rule or sub-rule, thus it is interpreted as an action which
- happens to precede the rule.
-
- This can be especially dangerous if you are in the habit of rearranging
- the order of alternatives in a rule. For instance:
-
- Changing this:
-
- b : <<int i=0,j=0;>> <<i++;>> b1 > [i] /* c1 */
- | <<j++;>> b1 > [i] /* c2 */
- ;
-
- to:
-
- b : /* empty production */ /* d1 */
- | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* d2 */
- | <<j++;>> b1 > [i]
- ;
-
- or to this:
-
- b
- : <<j++;>> b1 > [i] /* e1 */
- | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* e2 */
-
- changes an init-action into a non-init action, and vice-versa.
-
- Page 17
-
- 56. A particularly nasty form of the init-action problem is when
- an empty sub-rule has an associated action:
-
- rule!: ID (
- /* empty */
- <<#0=#[ID,$1.1];>>
- | array_bounds
- <<#0=#[T_array_declaration,$1.1],#1);>>
- )
- ;
-
- Since there is no reserved word in pccts for epsilon, the action
- for the empty arm of the sub-rule becomes the init-action. For
- this reason it's wise to follow one of the following conventions
- (1) represent epsilon with an empty rule "()" or (2) put empty
- as the last rule in a list of alternatives:
-
- rule!: ID (
- () <<#0=#[ID,$1.1];>>
- | array_bounds
- <<#0=#[T_array_declaration,$1.1],#1);>>
- )
- ;
-
- The cost of using "()" to represent epsilon is the execution of the macro
- zzBLOCK() at the start of the sub-rule and zzEXIT() at the end of the
- sub-rule. Macro zzBLOCK() creates a temporary stack pointer for the
- attribute stack and checks for overflow. Macro zzEXIT() pops any
- attributes that might have been placed on attribute stack. Since no
- attribute stack operations take place for epsilon this is wasted CPU
- cycles, however this is probably not a significant cost for many users.
-
- 57. Another form of problem caused by init-action occurs when one
- comments out a rule in the grammar in order to test an idea:
-
- rule /* a1 */
- : <<init-action;> /* a2 */
- //// rule_a /* a3 */
- | rule_b /* a4 */
- | rule_c /* a5 */
-
- In this case one only wanted to comment out the "rule_a" reference
- in line "a3". The reference is indeed gone, but the change has
- introduced an epsilon production - which probably creates a large
- number of ambiguities. Without the init-action the ":" would have
- probably have been commented out also, and ANTLR would report a
- syntax error - thus preventing one from shooting oneself in the foot.
-
- 58. In the case of sub-rules such as (...)+, (...)*, and {...} the
- init-action is executed just once before the sub-rule is entered.
- Consider the following example from section 3.6.1 (page 29) of the 1.00
- manual:
-
- a : <<List *p=NULL;>> // initialize list
- Type
- ( <<int i=0;>> // initialize index
- Var <<append(p,i++,$1);>>
- )*
- <<OperateOn(p);>>
- ;
-
- Page 18
-
- 59. Associativity and precedence of operations is determined by
- nesting of rules. In the example below "=" associates to the right
- and has the lowest precedence. Operators "+" and "*" associate to
- the left with "*" having the highest precedence.
-
- expr0 : expr1 {"=" expr0};
- expr1 : expr2 ("\+" expr2)*;
- expr2 : expr3 ("\*" expr3)*;
- expr3 : ID;
-
- See Example 2.
-
- 60. Fail actions for a rule can be placed after the final ";" of
- a rule. These will be:
-
- "executed after a syntax error is detected but before
- a message is printed and the attributes have been destroyed.
- However, attributes are not valid here because one does not
- know at what point the error occurred and which attributes
- even exist. Fail actions are often useful for cleaning up
- data structures or freeing memory."
-
- (Page 29 of 1.00 manual)
-
- Example of a fail action:
-
- a : <<List *p=NULL;>>
- ( Var <<append(p,$1);>> )+
- <<operateOn(p);rmlist(p);>>
- ; <<rmlist(p);>>
- ************** <--- Fail Action
-
- 61. When you have rules with large amounts of lookahead (that may
- cross several lines) you can use the ANTLR -gk option to make an
- ANTLR-generated parser delay lookahead fetches until absolutely
- necessary. To get better line number information (e.g. for error
- messages or #line directives) place an action which will save
- "zzline" in a variable at the start of the production where you
- want better line number information:
-
- a : <<int saveCurrentLine;>>
- <<saveCurrentLine = zzline;>> A B C
- << /* use saveCurrentLine not zzline here */ >>
- | <<saveCurrentLine = zzline;>> A B D
- << /* use saveCurrentLine not zzline here */ >>
- ;
-
- After the production has been matched you can use saveCurrentLine
- rather than the bogus "zzline".
-
- Contributed by Terence "The ANTLR Guy" Parr (parrt@acm.org)
-
- In version 1.20 a new macro, ZZINF_LINE(), was added to extract line
- information in a manner similar to LATEXT when using infinite lookahead
- mode. See the page 6 of the 1.20 release notes for more information.
- There is nothing like ZZINF_COL() for column information, but it should
- be easy to create using ZZINF_LINE() as a model.
-
- 62. An easy way to get a list of the names of all the rules is
- to grep tokens.h for the string "void" or edit the output from ANTLR
- run with the -cr option (cross-reference).
-
- Page 19
-
- 63. It took me a while to understand in an intuitive way the difference
- between full LL(k) lookahead given by the ANTLR -k switch and the
- linear approximation given by the ANTLR -ck switch. This was in spite
- of the example given in section 5 (pages 18 to 21) of the 1.10 release notes.
-
- Most of the time I run ANTLR with -k 1 and -ck 2. Because I didn't
- understand the linear approximation I didn't understand the warnings about
- ambiguity. I couldn't understand why ANLTR would complain about something
- which I thought was obviously parse-able with the lookahead available.
- Was it a bug or was it me ? I would try to make the messages go away
- totally, which was sometimes very hard. If I had understood the linear
- approximation I might have been able to fix them easily or at least have
- realized that there was no problem with the grammar, just with the
- limitations of the linear approximation.
-
- I will restrict the discussion to the case of "-k 1" and "-ck 2".
-
- Consider the following example:
-
- rule1 : rule2a | rule2b | rule2c ;
- rule2a : A X | B Y | C Z ;
- rule2b : B X | B Z ;
- rule2c : C X ;
-
- It should be clear that with the sentence being only two tokens this
- should be parseable with LL(2).
-
- Instead, because k=1 and ck=2 ANTLR will produce the following messages:
-
- /pccts120/bin/antlr -k 1 -gs -ck 2 -gh example.g
- Antlr parser generator Version 1.20 1989-1994
- example.g, line 23: warning: alts 1 and 2 of the rule itself
- ambiguous upon { B }, { X Z }
- example.g, line 23: warning: alts 1 and 3 of the rule itself
- ambiguous upon { C }, { X }
-
- The code generated resembles the following:
-
- if (LA(1)==A || LA(1)==B || LA(1)==C) &&
- (LA(2)==X || LA(2)==Y || LA(2)==Z) then rule2a()
-
- else if (LA(1)==B) &&
- (LA(2)==X || LA(2)==Y) then rule2b()
-
- else if (LA(1)==C) &&
- (LA(2)==Z) then rule3a()
- ...
-
- This might be called "product-of-sums". There is an "or" part for
- LA(1), an "or" part for LA(2), and they are combined using "and".
- To match, the first lookahead token must be in the first set and the second
- lookahead token must be in the second set. It doesn't matter that what
- one really wants is:
- Page 20
-
- if (LA(1)==A && LA(2)==X) ||
- (LA(1)==B && LA(2)==Y) ||
- (LA(1)==C && LA(2)==Z) then rule2a()
-
- else if (LA(1)==B && LA(2)==X) ||
- (LA(1)==B && LA(2)==Z) then rule2b()
-
- else if (LA(1)==C && LA(2)==X) then rule2c()
-
- This happens to be "product-of-sums" but the real problem is that each
- product involves one element from LA(1) and one from LA(2) and as the
- number of possible tokens increases the number of terms grows as N**2.
- With the linear approximation the number of terms grows (surprise)
- linearly in the number of tokens.
-
- ANTLR won't do this with k=1, it will only do "product-of-sums". However,
- all is not lost - you simply add a few well chosen semantic predicates
- which you have computed using your LL(k>1), all purpose, carbon based,
- analog computer.
-
- The linear approximation selects for each branch of the "if" a set which
- MAY include more than what is wanted. It never selects a subset of the
- correct lookahead sets! We simply insert a hand-coded version of the
- LL(2) computation. It's ugly, especially in this case, but it fixes the
- problem. In large grammars it may not be possible to run ANTLR with k=2,
- so this fixes a few rules which cause problems. The generated parser may
- run faster because it will have to evaluate fewer terms at execution time.
-
- <<
- int bypass_rule2a() {
- if ( LA(1)==B && LA(2)==Y ) return 0;
- if ( LA(1)==B ) return 1;
- if ( LA(1)==C && LA(2)==X ) return 1;
- return 0;
- }
- >>
-
- rule1 :
- <<bypass_rule2a()>>? rule2a | rule2b | rule2c ;
- rule2a : A X | B Y | C Z ;
- rule2b : B X | B Z ;
- rule2c : C X ;
-
- The real cases I've coded have shorter code sequences in the semantic
- predicate. I coded this as a function to make it easier to read and
- because there is a bug in 1.1x and 1.2x which prevents semantic predicates
- from crossing lines. Another reason to use a function (or macro) is to
- make it easier to read the generated code to determine when your semantic
- predicate is being hoisted too high (it's easy to find references to a
- function name with the editor - but difficult to locate a particular
- sequence of "LA(1)" and "LA(2)" tests. Predicate hoisting is a separate
- issue which is described elsewhere in this note.
- Page 21
-
- In some cases of reported ambiguity it is not necessary to add semantic
- predicates because no VALID token sequence could get to the wrong rule.
- If the token sequence were invalid it would be detected by the grammar
- eventually, although perhaps not where one might wish. In other cases
- the only necessary action is a reordering of the ambiguous rules so
- that a more specific rule is tested first. The error messages still
- appear, but one can ignore them or place a trivial semantic predicate
- (i.e. <<1>>? ) in front of the later rules. This makes ANTLR happy
- because it thinks you've added a semantic predicate which fixes things.
-
- Some constructs just invite problems. For instance in C++ with a suitable
- definition of the class "C" one can write:
-
- C a,b,c /* a1 */
- a.func1(b); /* a2 */
- a.func2()=c; /* a3 */
- a = b; /* a4 */
- a.operator =(b); /* a5 */
-
- Statement a5 happens to place an "=" (or any of the usual C++ operators)
- in a token position where it can cause a lot of ambiguity in the lookahead.
- set. I eventually solved this particular problem by creating a special
- #lexclass for things which follow "operator". I use an entirely different
- token number for such operators - thereby avoiding the whole problem.
-
- //
- // C++ operator sequences
- //
- // operator <type_name>
- // operator <special characters>
- //
- // There must be at least one non-alphanumeric character between
- // "operator" and operator name - otherwise they would be run
- // together - ("operatorint" instead of "operator int")
- //
-
- #lexclass LEX_OPERATOR
- #token FILLER_C1 "[\ \t]*"
- <<zzskip();
- if( isalnum(zzchar) ) zzmode(START);
- >>
- #token OPERATOR_STRING "[\+\-\*\/\%\^\&\|\~\!\=\<\>]*"
- <<zzmode(START);>>
- #token FILLER_C2 "\(\) | \[\] "
- <<NLA=OPERATOR_STRING;zzmode(START);>>
- Page 22
-
- ===============================================================================
- Section on Attributes
- -------------------------------------------------------------------------------
- 64. Attributes are built automatically only for terminals. For
- rules (non-terminals) one must assign an attribute to $0, use the
- $[token,...] convention for creating attributes, or use zzcr_attr().
-
- 65. The way to access the text (or whatever) part of an attribute
- depends on the way the attribute is stored.
-
- If one uses the pccts supplied routine "pccts/h/charbuf.h" then
-
- id : "[a-z]+" <<printf("Token is %s\n",$1.text);>>
-
- If one uses the pccts supplied routine "pccts/h/charptr.c" and
- "pccts/h/charptr.h" then:
-
- id : "[a-z]+" <<printf("Token is %s\n",$1);>>
-
- If one uses the pccts supplied routine "pccts/h/int.h" (which
- stores numbers only) then:
-
- number : "[0-9]+" <<printf ("Token is %d\n",$1);>>
-
- Note the use of %d rather than %s in the printf() format.
-
- 66. The expression $$ refers to the attribute of the named rule.
- The expression $0 refers to the attribute of the the enclosing rule,
- (which might be a sub-rule).
-
- rule : a b (c d (e f g) h) i
-
- For (e f g) $0 becomes $3 of (c d ... h). For (c d ... h) $0 becomes
- $3 of (a b ... i). However $$ always is equivalent to $rule.
-
- 67. If you define a zzcr_attr() or zzmk_attr() which allocates resources
- such as strings from the heap don't forget to define a zzd_attr() routine
- to release the resources when the attribute is deleted.
-
- 68. Attributes go out of scope when the rule or sub-rule that defines
- them is exited. Don't try to pass them to an outer rule or a sibling
- rule. The only exception is $0 which may be passed back to the containing
- rule as a return argument. However, if the attribute contains a pointer
- which is copied (e.g. pccts/h/charptr.c) then extra caution is required
- because of the actions of zzd_attr(). For C++ users this should be
- implemented in the class copy constructor. The version of pccts/h/charptr.*
- distributed with pccts does not use C++ features. See the next item for
- more information.
-
- 69. The pccts/h/charptr.c routines use a pointer to a string. The string
- itself will go out of scope when the rule or sub-rule is exited. Why ?
- The string is copied to the heap when ANTLR calls the routine zzcr_attr()
- supplied by charptr.c - however ANTLR also calls the charptr.c supplied
- routine zzd_attr() (which frees the allocated string) as soon as the rule or
- sub-rule exits. The result is that in order to pass charptr.c strings to
- outer rules (for instance to $0) it is necessary to make an independent
- copy of the string using strdup or else zero the pointer to prevent its
- deallocation.
-
- Page 23
-
- 70. To initialize $0 of a sub-rule use a construct like the following:
-
- decl : typeID
- Var <<$2.type = $1;>>
- ( "," Var <<$2.type = $0;>>)*[$1]
- **** <--------------
-
- See section 4.1.6.1 (page 29) of the 1.00 manual
-
- 71. One can use the zzdef0() macro to define a standard method for
- initializing $0 of a rule or sub-rule. If the macro is defined it is
- invoked as zzdef0(&($0)).
-
- See section 4.1.6.1 (page 29) of the 1.00 manual
-
- I believe that for C++ users this would be handled by the class constructor.
-
- 72. If you construct temporary attributes in the middle of the
- recognition of a rule, remember to deallocate the structure should the
- rule fail. The code for failure goes after the ";" and before the next
- rule. For this reason it is sometimes desirable to defer some processing
- until the rule is recognized rather than the most convenient place.
-
- #include "pccts/h/charptr.h"
-
- ;statement!
- : <<char *label=0;>>
- {ID COLON <<label=MYstrdup($1);>> }
- statement_without_label
- <<#0=#(#[T_statement,label],#2);
- if (label!=0) free(label);
- // AST #1 is undefined
- // AST #2 is returned by
- // statement_without_label
- >>
- ;<<if (label !=0) free(label);>>
-
- In the above example attributes are handled by charptr.*. Readers of this
- note have been warned earlier about its dangers. The routine I have
- written to construct ASTs from attributes (invoked by #[int,char *]) knows
- about this behavior and automatically makes a copy of the character string
- when it constructs the AST. This makes the copy created by the explicit
- call to MYstrdup redundant once the AST has been constructed. If the call
- to "statement_without_label" fails then the temporary copy must be
- deallocated.
- Page 24
-
- ===============================================================================
- Section on ASTs
- -------------------------------------------------------------------------------
- 73. If you define a zzcr_ast() or zzmk_ast() which allocates resources
- such as strings from the heap don't forget to define a zzd_ast() routine
- to release the resources when the AST is deleted. For C++ users this
- should be implemented as part of the class destructor.
-
- 74. If you construct temporary ASTs in the middle of the recognition of a
- rule, remember to deallocate the structure should the rule fail. The code
- for failure goes after the ";" and before the next rule. For this reason
- it is sometimes desirable to defer some processing until the rule is
- recognized rather than the most appropriate place. For C++ users this
- might be implemented as part of the class destructor.
-
- If the temporary is an AST returned by a called rule then you'll probably
- have to call zzfree_ast() to release the entire AST tree. Consider
- the following example:
-
- obj_name! /* a1 */
- : <<AST *node=0;>> /* a2 */
- class_name <<node=#1;>> /* a3 */
- ( /* a4 */
- () /* empty */ /* a5 */
- <<#0=node;node=0;>> /* a6 */
- | COLON_COLON follows_dot_class[node] /* a7 */
- <<#0=#2;node=0;>> /* a8 */
- ) /* a9 */
- ......... /* a10 */
- /* a11 */
- ; <<if (node!=0) zzfree_ast(node);>> /* a12 */
-
- In this case "class_name" may return a full AST tree (not a trivial tree)
- because of information required to represent template classes (e.g.
- dictionary<int,1000> is a "class_name"). This tree ("node") is passed to
- another rule ("follows_dot_class") which uses it to construct another AST
- tree which incorporates it. If "follows_dot_class" succeeds then node is
- set to 0 (lines a6 or a8) because the tree is now referenced via #2. If
- "follows_dot_class" fails then the entire tree created by class_name must
- be deallocated (line a12). The temporary "node" must be used because there
- is no convenient way (such as #1.1) to refer to class_name from within the
- sub-rule.
-
- Please note the use of an empty sub-rule ("()" on line a5) to avoid the nasty
- init-action problem mentioned earlier.
-
- 75. Example 6 shows debugging code to help locate ASTs that were created
- but never deleted.
-
- 76. If you want to place prototypes for routines that have an AST
- as an argument in the #header directive you should explicitly
- #include "ast.h" after the #define AST_FIELDS and before any references
- to AST:
-
- #define AST_FIELDS int token;char *text;
- #include "ast.h"
- #define zzcr_ast(ast,attr,tok,astText) \
- create_ast(ast,attr,tok,text)
- void create_ast (AST *ast,Attr *attr,int tok,char *text);
-
- Page 25
-
- 77. The make-a-root operator for ASTs ("^") can be applied only to
- terminals. (This includes items identified in #token ,#tokclass, and
- #tokdef statements). I think this is because a child rule might return a
- tree rather than a single AST. If it did then it could not be made into a
- root as it is already a root and the corresponding fields of the structure
- are already in use. To make an AST returned by a called rule a root use
- the expression: #(root-rule sibling1 sibling2 sibling3).
-
- add ! : expr ("\+"^ expr) ; // Is ok
-
- addOperator ! : expr (AddOp expr) // Is NOT ok
- addOp : "\+" | "-"; //
-
- Example 2 describes a workaround for this restriction.
-
- 78. Because it is not possible to use an already constructed AST tree
- as the root of a new tree (unless it's a trivial tree with no children)
- one should be suspicious of any constructs like the following:
-
- rule! : ........ <<#0=#(#1,...)...;>>
- ** <=====================
-
- If #1 is a non-trivial tree its existing children will be lost when the
- new tree is constructed for assignment to #0.
-
- 79. Do not assign to #0 of a rule unless automatic construction of ASTs
- has been disabled using the "!" operator:
-
- a! : x y z <<#0=#(#1 #2 #3);>> // ok
- a : x y z <<#0=#(#1 #2 #3);>> // NOT ok
-
- The reason for the restriction is that assignment to #0 will cause any
- ASTs pointed to by #0 to be lost when the pointer is overwritten.
-
- The stated restriction is somewhat stronger than necessary. You can
- assign to #0 even when using automated AST construction, if the old
- tree pointed to by #0 is part of the new tree constructed by #(...).
- For example:
-
- #token COMMA ","
- #token STMT_LIST
-
- stmt_list: stmt (COMMA stmt)* <<#0=#(#[STMT_LIST],#0);>>
-
- The automatically constructed tree pointed to by #0 is just put at the
- end of the new list, so nothing is lost.
-
- If you reassign to #0 in the middle of the rule, automatic tree
- construction will result in the addition of remaining elements at the end
- of the new tree. This is not recommended by TJP.
-
- Special care must be used when combining the make-as-root operator
- (e.g. rule: expr OP^ expr) with this transgression (assignment to #0 when
- automatic tree construction is selected).
-
- Page 26
-
- 80. Even when automatic construction of ASTs is turned off in a rule the
- called rules still return the ASTs that they constructed. The same applies
- when the "!" operator is applied to a called rule. This is hard to
- believe when one sees a rule like the following:
-
- rule: a! b! c!
-
- generate (in part) a sequence of operations like:
-
- _ast = NULL; a(&_ast);
- _ast = NULL; b(&_ast);
- _ast = NULL; c(&_ast);
-
- It appears that the AST pointer is being assigned to a temporary where it
- becomes inaccessible. This is not the case at all. The called rule is
- responsible for placing a pointer to the AST which is constructed onto a
- stack of AST pointers. The stack of AST pointers is normally in global
- scope with ZZAST_STACKSIZE elements.
-
- (The "!" operator simply inhibits the automatic construction of the
- AST trees. It does not prevent the construction of the ASTs themselves.
- When calling a rule which constructs ASTs and not using the result one
- must destroy the constructed AST using zzfree_ast() in order to avoid a
- memory leak. See Example 6 below for code which aids in tracking lost
- ASTs).
-
- Consider the following examples (using the list notation of page 45 of
- the 1.00 manual):
-
- a: A;
- b: B;
- c: C;
-
- #token T_abc_node
-
- rule : a b c ; <<;>> /* AST list (0 A B C) without root */
- rule ! : a b c <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */
- rule : a! b! c! <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */
- rule : a^ b c /* AST tree (A B C) with root A */
- rule ! : a b c <<#0=#(#1,#2,#3);>> /* AST tree (A B C) with root A */
-
- rule ! : a b c <<#0=#(#[T_abc_node,0],#1,#2,#3);>>
- /* AST tree (T_abc_node A B C) */
- /* with root T_abc_node */
- rule : a b c <<#0=#(#[T_abc_node,0],#0);>> /* the same as above */
- rule : a! b! c! <<#0=#(#[T_abc_node,0],#1,#2,#3);>> /* the same as above */
-
- rule ! : a b c <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */
- rule : a b c <<#0=#(toAST(T_abc_node),#0);>> /* the same as above */
- rule : a! b! c! <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */
-
- The routine "toAST()" calls zzmk_ast() to construct an AST given the token number.
- For a typical version of zzmk_ast() it would look something like the following:
-
- AST * toAST (int tokenID) {
- return zzmk_ast (zzastnew(),tokenID,NULL);
- }
-
- I find toAST() more convenient than passing the extra arguments to zzmk_ast()
- using a construct like #[T_abc_node,0] or writing zzmk_ast() with varargs.
- Using varargs defeats most forms of inter-procedural type checking (unless you
- are using C++ which allows overloaded function names).
-
- Page 27
-
- 81. If you use ASTs you have to pass a root AST to ANTLR.
-
- AST *root=NULL;
- again:
- ANTLR (start(&root),stdin);
- walk_the_tree(root);
- zzfree_ast(root);
- root=NULL;
- goto again;
-
- 82. zzfree_ast(AST *tree) will recursively descend the AST tree and free
- all sub-trees. The user should supply a routine zzd_ast() to free any
- resources used by a single node - such as pointers to character strings
- allocated on the heap. See Example 2 on associativity and precedence.
-
- 83. AST elements in rules are assigned numbers in the same fashion as
- attributes with three exceptions:
-
- 1. A hole is left in the sequence when sub-rules are encountered.
- (e.g. "(...)+", "(...)*", and "{...}").
- 2. #0 is the AST of the named rule, not the sub-rule - see the next item
- 3. There is nothing analogous to $i.j notation (which allows one
- to refer to attributes from earlier in the rule). In other words,
- you can't use #i.j notation to refer to an AST created earlier
- in the rule.
-
- There are plans to add notation similar to Sorcerer's
- tagged elements, but this will not appear for a while.
- (24-Aug-94). It is not part of version 1.22.
-
- Consider the following example:
-
- a : b // B is #1 for the rule
- (c d)* // C is #1 when scope is inside the sub-rule
- // D is #2 when scope is inside the sub-rule
- // You may *NOT* refer to b as #1.1
- e // E is #3 for the rule
- // There is NO #2 for the rule
-
- 84. The expression #0 refers to the AST of the named rule. Thus it is
- a misnomer and (for consistentcy) should probably have been named ## or #$.
- There is nothing equivalent to $0 for ASTs. This is probably because
- sub-rules aren't assigned AST numbers in a rule. See above.
-
- Page 28
-
- 85. Associativity and precedence of operations is determined by nesting
- of rules. In the example below "=" associates to the right and has the
- lowest precedence. Operators "+" and "*" associate to the left with "*"
- having the highest precedence.
-
- expr0 : expr1 {"=" expr0};
- expr1 : expr2 ("\+" expr2)*;
- expr2 : expr3 ("\*" expr3)*;
- expr3 : ID;
-
- In Example 2 the zzpre_ast() routine is used to walk all the AST nodes.
- The AST nodes are numbered during creation so that one can see the order in
- which they are created and the order in which they are deleted. Do not
- confuse the "#" in the sample output with the AST numbers used to refer to
- elements of a rule in the action part of a the rule. The "#" in the
- sample output are just to make it simpler to match elements of the
- expression tree with the order in which zzd_ast() is called for each node in
- the tree.
-
- 86. If the make-as-root operator were NOT used in the rules:
-
- ;expr0 : expr1 {"=" expr0}
- ;expr1 : expr2 ("\+" expr2)*
- ;expr2 : expr3 ("\*" expr3)*
- ;expr3 : ID
-
- With input:
-
- a+b*c
-
- The output would be:
-
- a <#1> \+ <#2> b <#3> \* <#4> c <#5> NEWLINE <#6>
-
- zzd_ast called for <node #6>
- zzd_ast called for <node #5>
- zzd_ast called for <node #4>
- zzd_ast called for <node #3>
- zzd_ast called for <node #2>
- zzd_ast called for <node #1>
-
- Page 29
-
- 87. Suppose that one wanted to replace the terminal "+" with the rule:
-
- addOp : "\+" | "-" ;
-
- Then one would be unable to use the "make-as-root" operator because it can
- be applied only to terminals.
-
- There are two workarounds. The #tokclass feature allows one to write:
-
- #tokclass AddOp { "\+" "\-"}
-
- A #tokclass identifier may be used in a rule wherever a simple #token
- identifier may be used.
-
- The other workaround is much more complicated:
-
- expr : (expr0 NEWLINE)
- ;expr0 : expr1 {"="^ expr0}
- ;expr1! : expr2 <<#0=#1;>>
- (addOp expr2 <<#0=#(#1,#0,#2);>> )*
- ;expr2 : expr3 ("\*"^ expr3)*
- ;expr3 : ID
- ;addOp : "\+" | "\-"
-
- With input:
-
- a-b-c
-
- The output is:
-
- ( \- <#4> ( \- <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6>
-
- The "!" for rule "expr1" disables automatic constructions of ASTs in the
- rule. This allows one to manipulate #0 manually. If the expression had
- no addition operator then the sub-rule "(addOp expr)*" would not be
- executed and #0 will be assigned the AST constructed by rule expr2 (i.e.
- AST #1). However if there is an addOp present then each time the sub-rule
- is rescanned due to the "(...)*" the current tree in #0 is placed as the
- first of two siblings underneath a new tree. This new tree has the AST
- returned by addOp (AST #1 of the addOp sub-rule) as the root.
-
- 88. There is an option for doubly linked ASTs in the module ast.c. It is
- controlled by #define zzAST_DOUBLE. Even with zzAST_DOUBLE only the right
- and down fields are filled while the AST tree is constructed. Once the tree
- is constructed the user must call the routine zzdouble_link(tree,NULL,NULL) to
- traverse the tree and fill in the left and up fields. See page 12 of the
- 1.06 manual for more information.
-
- 89. If a rule which creates an AST is called and the result is not
- linked into the tree being constructed then zzd_ast() will not be called
- to release the resources used by the rule. Prior to version 1.20
- this was especially important when rules were used in syntactic predicates.
- Versions >= 1.20 bypasses construction of all ASTs during guess mode.
- Page 30
-
- ===============================================================================
- Section on Semantic Predicates
- -------------------------------------------------------------------------------
- 90. There is a bug in 1.1x and 1.2x which prevents semantic predicates
- from including string literals. The predicate is incorrectly
- "string-ized" in the call to zzfailed_predicate.
-
- rule: <<containsCharacter("!@#$%^&*",LATEXT(1))>>? ID
- /* Will not work */
-
- The workaround is to place the literal in a string constant and use
- the variable name.
-
- 91. There is a bug in 1.1x and 1.2x which prevents semantic predicates from
- crossing lines unless one uses an escaped newline.
-
- rule: <<do_test();\ /*** Note escaped newline ***/
- this_works_in_120)>>? x y z;
-
- 92. Semantic predicates are enclosed in "<<... >>?" but because they are
- inside "if" statements they normally do not end with a ";" - unlike other
- code enclosed in "<<...>>" in ANTLR.
-
- 93. Semantic predicates which are not the first element in the rule or
- sub-rule become "validation predicates" and are not used for prediction.
- After all, if there are no alternatives, then there is no need for
- prediction - and alternatives exist only at the left edge of rules
- and sub-rules. Even if the semantic predicates are on the left edge it
- is no guarantee that it will be part of the prediction expression.
- Consider the following two examples:
-
- a : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* a1 */
- | ID glob /* a2 */
- ;
- b : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* b1 */
- | NUMBER glob /* b2 */
- ;
-
- Rule a requires the semantic predicate to disambiguate alternatives
- a1 and a2 because the rules are otherwise identical. Rule b has a
- token type of NUMBER in alternative b2 so it can be distinguished from
- b1 without evaluation of the semantic predicate during prediction. In
- both cases the semantic predicate will also be evaluated inside the rule.
-
- When the tokens which can follow a rule allow ANTLR to disambiguate the
- expression without resort to semantic predicates ANTLR may not evaluate
- the semantic predicate in the prediction code. For example:
-
- simple_func : <<LA(1)==ID ? isSimpleFunc(LATEXT(1)) : 1>>? ID
- complex_func : <<LA(1)==ID ? isComplexFunc(LATEXT(1)) : 1>>? ID
-
- function_call : "(" ")"
-
- func : simple_func function_call
- | complex_func "." ID function_call
-
- In this case, a "simple_func" MUST be followed by a "(", and a
- "complex_func" MUST be followed by a ".", so it is unnecessary to evaluate
- the semantic predicates in order to predict which of the alternative to
- use. A simple test of the lookahead tokens is sufficient. As stated
- before, the semantic predicates will still be used to validate the rule.
-
- Page 31
-
- 94. Suppose that the requirement that all semantic predicates which are
- used in prediction expressions must appear at the left hand edge of a rule
- were lifted? Consider the following code segment:
-
- cast_expr /* a1 */
- : LP typedef RP cast_expr /* a2 */
- | expr13 /* a3 */
- ;expr13 /* a4 */
- : id_name /* a5 */
- | LP cast_expr RP /* a6 */
- ;typedef /* a7 */
- : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* a8 */
- ;id_name /* a9 */
- : ID /* a10 */
-
- Now consider the token sequences:
-
- Token: #1 #2 #3 #4
- -- ----------------------- -- --
- "(" ID-which-is-typedef ")" ID
- "(" ID-which-is-NOT-typedef ")"
-
- Were the semantic predicate at line a8 hoisted to predict which alternative
- of cast_expr to use (a2 or a3) the program would use the wrong lookahead
- token (LA(1) and LATEXT(1)) rather than LA(2) and LATEXT(2) to check for an
- ID which satisfies "isTypedefName()". This is because it is preceded by a
- "(". This problem could perhaps be solved by application of sufficient
- ingenuity, however, in the meantime the solution is to rewrite the rules
- so as to move the decision point to the left edge of the production.
-
- First perform in-line expansion of expr13 (line a3) in cast_expr:
-
- cast_expr /* b1 */
- : LP typedef RP cast_expr /* b2 */
- | id_name /* b3 */
- | LP cast_expr RP /* b4 */
-
- Secondly, move the alternatives (in cast_expr) beginning with LP to a
- separate rule so that "typedef" and "cast_expr" will be on the left edge:
-
- cast_expr /* c1 */
- : LP cast_expr_suffix /* c2 */
- | id_name /* c3 */
- ;cast_expr_suffix /* c4 */
- : typedef RP cast_expr /* c5 */
- | cast_expr RP /* c6 */
- ;typedef /* c7 */
- : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* c8 */
- ;id_name /* c9 */
- : ID /* c10 */
-
- This will result in the desired treatment of the semantic predicate to
- choose from alternatives c5 and c6.
-
- Page 32
-
- 95. Validation predicates are evaluated by the parser. If they fail a
- call to zzfailed_predicate(string) is made. To disable the message
- redefine the macro zzfailed_predicate(string) or use the optional
- "failed predicate" action which is enclosed in "[" and "]" and follows
- immediately after the predicate:
-
- a : <<LA(1)==ID ?
- isTypedef(LATEXT(1)) : 1>>?[printf("Not a typedef\n");]
-
- 96. An expression in a semantic predicate (e.g. <<isFunc()>>? ) should not
- have side-effects. If there is no match then the rest of the rule using the
- semantic predicate won't be executed.
-
- Page 33
-
- 97. What is the "context" of a semantic predicate ? Answer due to TJP:
-
- The context of a predicate is the set of k-strings (comprised of lookahead
- symbols) that can be matched following the execution of a predicate. For
- example,
-
- a : <<p>>? alpha ;
-
- The context of "p" is LOOK(alpha) where LOOK(alpha) is the set of
- lookahead k-strings for alpha.
-
- Normally, one should compute the context for ANTLR (manually) because
- ANTLR is not smart enough to know the nature of your predicate and does not
- know how much context information is needed; it's conservative and tries
- to compute full LL(k) lookahead. Normally, you only need one token:
-
- class_name: <<isClass(LATEXT(1))>>? ID ;
-
- This example is incomplete, the predicate should really be:
-
- class_name: <<LA(1)==ID ? isClass(LATEXT(1)) : 1>>? ID ;
-
- This says, "I can tell you something if you have an ID, otherwise
- just assume that the rule is semantically valid." This only makes a
- difference if the predicate is *hoisted* out of the rule. Here is an
- example that won't work because it doesn't have context check in the
- predicates:
-
- a : ( class_name | NUM )
- | type_name
- ;
-
- class_name : <<isClass(LATEXT(1))>>? ID ;
-
- type_name : <<isType(LATEXT(1))>>? ID ;
-
- The prediction for production one of rule "a" will be:
-
- if ( LA(1) in { ID, NUM } && isClass(LATEXT(1)) ) { ...
-
- Clearly, NUM will never satisfy isClass(), so the production will never
- match.
-
- When you ask ANTLR to compute context, it can check for missing predicates.
- With -prc on, for this grammar:
-
- a : b
- | <<isVar(LATEXT(1))>>? ID
- | <<isPositive(LATEXT(1))>>? NUM
- ;
-
- b : <<isType(LATEXT(1))>>? ID
- | NUM
- ;
-
- ANTLR reports:
-
- warning alt 1 of rule itself has no predicate to resolve
- ambiguity upon \{ NUM \}
-
- Page 34
-
- 98. A documented restriction of ANTLR is the inability to hoist multiple
- semantic predicates. However, no error message is given when one attempts
- this. When compiled with k=1 and ck=2 this generates inappropriate code
- in "statement" when attempting to predict "expr":
-
- #header <<
-
- #include "charbuf.h"
-
- int istypedefName (char *);
- int isCommand (char *);
-
- >>
-
- #token BARK
- #token GROWL
- #token ID
-
- statement
- : expr
- | declaration
- ;expr
- : commandName BARK
- | typedefName GROWL
- ;declaration
- : typedefName BARK
- ;typedefName
- : <<LA(1)==ID ? istypedefName(LATEXT(1)) : 1>>? ID
- ;commandName
- : <<LA(1)==ID ? isCommand(LATEXT(1)) : 1>>? ID
- ;
-
- The generated code resembles the following:
-
- void statement()
- {
- if ( (LA(1)==ID) &&
- (LA(2)==BARK || LA(2)==GROWL) &&
- ( (LA(1)==ID ? isCommand(LATEXT(1)) : 1) ||
- (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) {
- expr();
- } else {
- if ( (LA(1)==ID) &&
- (LA(2)==BARK) &&
- (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) {
- declaration();
- } ...
-
- The problem is that "<typdefname> BARK" will be passed to expr() rather
- than declaration().
-
- Some help is obtained by using leading actions to inhibit hoisting as
- described in the next notes. However, omitting all semantic predicates
- in the prediction expression doesn't help if one requires them to predict
- the rule.
-
- Page 35
-
- 99. Leading actions will inhibit the hoisting of semantic predicates into
- the prediction of rules.
-
- expr_rhs
- : <<;>> <<>> expr0
- | command
-
- See the section on known bugs for a more complete example.
-
- 100. When using semantic predicates in ANTLR is is *IMPORTANT* to
- understand what the "-prc on" ("predicate context computation")
- option does and what "-prc off" doesn't do. Consider the following
- example:
-
- +------------------------------------------------------+
- | Note: All examples in this sub-section are based on |
- | code generated with -k=1 and -ck=1. |
- +------------------------------------------------------+
-
- expr : upper
- | lower
- | number
- ;
-
- upper : <<isU(LATEXT(1))>>? ID ;
- lower : <<isL(LATEXT(1))>>? ID ;
- number : NUMBER ;
-
- With "-prc on" ("-prc off" is the default) the code for expr() to predict
- upper() would resemble:
-
- if (LA(1)==ID && isU(LATEXT(1)) && LA(1)==ID) { /* a1 */
- upper(zzSTR); /* a2 */
- } /* a3 */
- else { /* a4 */
- if (LA(1)==ID && isL(LATEXT(1)) && LA(1)==ID) { /* a5 */
- lower(zzSTR); /* a6 */
- } /* a7 */
- else { /* a8 */
- if (LA(1)==NUMBER) { /* a9 */
- zzmatch(NUMBER); /* a10 */
- } /* a11 */
- else /* a12 */
- {zzFAIL();goto fail;} /* a13 */
- } /* a14 */
- } ...
- ...
-
- *******************************************************
- *** ***
- *** Starting with version 1.20: ***
- *** Predicate tests appear AFTER lookahead tests ***
- *** ***
- *******************************************************
-
- Note that each test of LATEXT(i) is guarded by a test of the token type
- (e.g. "LA(1)==ID && isU(LATEXT(1)").
- Page 36
-
- With "-prc off" the code would resemble:
-
- if (isU(LATEXT(1)) && LA(1)==ID) { /* b1 */
- upper(zzSTR); /* b2 */
- } /* b3 */
- else { /* b4 */
- if (isL(LATEXT(1)) && LA(1)==ID) { /* b5 */
- lower(zzSTR); /* b6 */
- } /* b7 */
- else { /* b8 */
- if ( (LA(1)==NUMBER) ) { /* b9 */
- zzmatch(NUMBER); /* b10 */
- } /* b11 */
- else /* b12 */
- {zzFAIL();goto fail;} /* b13 */
- } /* b14 */
- } ...
- ...
-
- Thus when coding the grammar for use with "-prc off" it is necessary
- to do something like:
-
- upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ;
- lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ;
-
- This will make sure that if the token is of type NUMBER that it is not
- passed to isU() or isL() when using "-prc off".
-
- So, you say to yourself, "-prc on" is good and "-prc off" is bad. Wrong.
-
- Consider the following slightly more complicated example in which the
- first alternative of rule "expr" contains tokens of two different types:
-
- expr : ( upper | NUMBER ) NUMBER
- | lower
- | ID
- ;
-
- upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ;
- lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ;
- number : NUMBER ;
-
- With "-prc off" the code would resemble:
-
- ...
- { /* c1 */
- if (LA(1)==ID && isU(LATEXT(1)) && /* c2 */
- ( LA(1)==ID || LA(1)==NUMBER) ) { /* c3 */
- { /* c4 */
- if (LA(1)==ID) { /* c5 */
- upper(zzSTR); /* c6 */
- } /* c7 */
- else { /* c8 */
- if (LA(1)==NUMBER) { /* c9 */
- zzmatch(NUMBER); /* c10 */
- } /* c11 */
- else {zzFAIL();goto fail;}/* c12 */
- } /* c13 */
- } ...
- ...
- Page 37
-
- Note that if the token is a NUMBER (i.e. LA(1)==NUMBER) then the clause at
- line c2 ("LA(1)==ID && ...") will always be false, which implies that the
- test in the "if" statement (lines c2/c3) will always be false. (In other
- words LA(1)==NUMBER implies LA(1)!=ID). Thus the sub-rule for NUMBER at
- line c9 can never be reached.
-
- With "-prc on" essentially the same code is generated, although it
- is not necessary to manually code a test for token type ID preceding
- the call to "isU()".
-
- The workaround is to to bypass the heart of the predicate when
- testing the wrong type of token.
-
- upper : <<LA(1)==ID ? isU(LATEXT(1)) : 1>>? ID ;
- lower : <<LA(1)==ID ? isL(LATEXT(1)) : 1>>? ID ;
-
- Then with "-prc off" the code would resemble:
- ...
- { /* d1 */
- if ( (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* d2 */
- (LA(1)==ID || LA(1)==NUMBER) ) { /* d3 */
- ...
- ...
-
- With this correction the body of the "if" statement is now reachable
- even if the token type is NUMBER - the "if" statement does what one
- wants.
-
- With "-prc on" the code would resemble:
-
- ... /* e1 */
- if (LA(1)==ID && /* e2 */
- (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* e3 */
- (LA(1)==ID || LA(1)==NUMBER) ) { /* e4 */
- ...
- ...
-
- Note that the problem of the unreachable "if" statement body has
- reappeared because of the redundant test ("e2") added by the predicate
- computation.
-
- The lesson seems to be: when using rules which have alternatives which
- are "visible" to ANTLR (within the lookahead distance) that have different
- token types it is probably dangerous to use "-prc on".
-
- Page 38
-
- 101. You cannot use downward inheritance to pass parameters
- to semantic predicates which are NOT validation predicates. The
- problem appears when the semantic predicate is hoisted into a
- parent rule to predict which rule to call:
-
- For instance:
-
- a : b1 [flag]
- | b2
- | b3
-
- b1 [int flag]
- : <<LA(1)==ID && flag && hasPropertyABC (LATEXT(1))>>? ID ;
-
- b2 :
- : <<LA(1)==ID && hasPropertyXYZ (LATEXT(1))>>? ID ;
-
- b3 : ID ;
-
- When the semantic predicate is evaluated within rule "a" to determine
- whether to call b1, b2, or b3 the compiler will discover that there
- is no variable named "flag" for procedure "a()". If you are unlucky
- enough to have a variable named "flag" in a() then you will have a
- VERY difficult-to-find bug.
-
- The -prc option has no effect on this behavior.
-
- It is possible that an init-action will inhibit the hoisting of the
- predicate and make this code work. I have not verified this with
- versions 1.2x.
-
- 102. Another reason why semantic predicates must not have side effects is
- that when they are hoisted into a parent rule in order to decide which
- rule to call they will be invoked twice: once as part of the prediction
- and a second time as part of the validation of the rule.
-
- Consider the example above of upper and lower. When the input does
- in fact match "upper" the routine isU() will be called twice: once inside
- expr() to help predict which rule to call, and a second time in upper() to
- validate the prediction. If the second test fails the macro zzpred_fail()
- is called.
-
- As far as I can tell, there is no simple way to disable the use of a
- semantic predicate for validation after it has been used for prediction.
-
- Page 39
-
- 103. I had a problem in which I needed to do a limited amount of
- lookahead, but didn't want to use all the machinery of syntactic
- predicates. I found that I could enlarge the set of expressions accepted
- by "expr" and then look at the AST created in order to determined what
- rules could follow:
-
- cast_expr /* a1 */
- : <<int isCast=0;>> /* a2 */
- /* a3 */
- LP! predefined_type RP! cast_expr /* a4 */
- <<#0=#(toAST(T_cast),#0);>> /* a5 */
- | LP! expr0 RP! /* a6 */
- <<if ((#2->token)==T_class_name) { /* a7 */
- isCast=1; /* a8 */
- } else { /* a9 */
- isCast=0; /* a10 */
- }; /* a11 */
- >> /* a12 */
- ( <<;>> <<isCast==1>>? /* a13 */
- <<printf ("\nIs cast expr\n");>> /* a14 */
- cast_expr /* a15 */
- <<#0=#(toAST(T_cast),#0);>> /* a16 */
- /* a17 */
- | <<printf ("\nIs NOT cast expr\n");>> /* a18 */
- () /* empty */ /* a19 */
- ) /* a20 */
- | unary_expr /* a21 */
-
- Later on I gave up on this approach and decided to use syntactic
- predicates anyway. It not only solved this problem, but others
- where it was more difficult to patch up the grammar. I can't bring
- myself to remove the example, though.
- Page 40
-
- ===============================================================================
- Section on Syntactic Predicates (also known as "Guess Mode")
- -------------------------------------------------------------------------------
- 104. The terms "infinite lookahead", "guess mode","syntactic predicates"
- are all equivalent. Sometimes the term "backtracking" is used as well,
- although " backtracking" can sometimes be used to discuss lexing and DLG
- as well. The term "syntactic predicate" emphasizes that is handled by the
- parser. The term "guess mode" emphasizes that the parser may have to
- backtrack. The term "infinite lookahead" emphasizes the implementation in
- ANTLR: the entire input is read, processed, and tokenized by DLG before
- ANTLR begins parsing.
-
- 105. An expression in a syntactic predicate should not have side-effects.
- If there is no match then the rule using the syntactic predicate won't be
- executed.
-
- 106. When using syntactic predicates the entire input buffer is read and
- tokenized by DLG before parsing by ANTLR begins. If a "wrong" guess
- requires that parsing be rewound to an earlier point all attributes
- that were creating during the "guess" are destroyed and the parsing
- begins again and it creates new attributes at it reparses the (previously)
- tokenized input.
-
- 107. In infinite lookahead mode the line and column information is
- hopelessly out-of-sync because zzline will contain the line number of
- the last line of input - the entire input was parsed before
- scanning was begun. The line and column information is not restored
- during backtracking. To keep track of the line information in a meaningful
- way one has to use the ZZINF_LINE macro which was added to pccts in version
- 1.20.
-
- Putting line and column information in a field of the attribute will not
- help. The attributes are created by ANTLR, not DLG, and when ANTLR
- backtracks it destroys any attributes that were created in making the
- incorrect guess.
-
- 108. As infinite lookahead mode causes the entire input to be scanned
- by DLG before ANTLR begins parsing, one cannot depend on feedback from
- the parser to the lexer to handle things like providing special token codes
- for items which are in a symbol table (the "lex hack" for typedefs
- in the C language). Instead one MUST use semantic predicates which allow
- for such decisions to be made by the parser.
-
- 109. One cannot use an interactive scanner (ANTLR -gk option) with the
- ANTLR infinite lookahead and backtracking options (syntactic predicates).
-
- 110. An example of the need for syntactic predicates is the case where
- relational expressions involving "<" and ">" are enclosed in angle bracket
- pairs.
-
- Relation: a < b
- Array Index: b <i>
- Problem: a < b<i>
- vs. b < a>
-
- I was going to make this into an extended example, but I haven't had
- time yet.
-
- 111. Version 1.20 fixes a problem in 1.10 in which ASTs were constructed
- during guess mode. In version 1.10 care had to be taken to deallocate the
- ASTs that were created in the rules which were invoked in guess mode.
-
- Page 41
-
- 112. The following is an example of the use of syntactic predicates.
-
- program : ( s SEMI )* ;
-
- s : ( ID EQUALS )? ID EQUALS e
- | e
- ;
-
- e : t ( PLUS t | MINUS t )* ;
-
- t : f ( TIMES f | DIV f )* ;
-
- f : Num
- | ID
- | "\(" e "\)"
- ;
-
- When compiled with k=1:
-
- antlr -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h \
- -fm mode.h -k 1 test.g
-
- One gets the following warning:
-
- warning: alts 1 and 2 of the rule itself ambiguous upon { ID }
-
- even though the manual suggests that this is okay. The only problem is
- that ANTLR 1.10 should NOT issue this error message unless the -w2 option
- is selected.
-
- Included with permission of S. Salters
- Page 42
-
- ===============================================================================
- Section on Inheritance
- -------------------------------------------------------------------------------
- 113. A rule which uses upward inheritance:
-
- rule > [int result] : x | y | z;
-
- Is simply declaring a function which returns an "int" as a function
- value. If the function has more than one item passed via upward
- inheritance then ANTLR creates a structure to hold the result and
- then copies each component of the structure to the upward inheritance
- variables.
-
- 114. When writing a rule that uses downward inheritance:
-
- rule [int *x] : r1 r2 r3
-
- one should remember that the arguments passed via downward inheritance are
- simply arguments to a function. If one is using downward inheritance
- syntax to pass results back to the caller (really upward inheritance !)
- then it is necessary to pass the address of the variable which will receive
- the result.
-
- 115. ANTLR is smart enough to combine the declaration for an AST with
- the items declared via downward inheritance when constructing the
- prototype for a function which uses both ASTs and downward inheritance.
- Page 43
-
- ===============================================================================
- Section on LA, LATEXT, NLA, and NLATEXT
- -------------------------------------------------------------------------------
- 116. Need examples of LATEXT for various forms of lookahead.
-
- ===============================================================================
- Section on Prototypes
- -------------------------------------------------------------------------------
- 117. Prototype for typical create_attr routine:
-
- #define zzcr_attr(attr,token,text) \
- create_attr(attr,token,text)
-
- void create_attr (Attrib *attr,int token,char *text);
-
- 118. Prototype for a typical create_ast routine invoked to automatically
- construct an AST from an attribute:
-
- #define zzcr_ast(ast,attr,tok,astText) \
- create_ast(ast,attr,tok,text)
-
- void create_ast (AST *ast,Attr *attr,int tok,char *text);
-
- 119. Prototype for a typical make_ast routine invoked by the #[...]
- notation.
-
- AST *zzmk_ast (AST *ast,int token,char *text)
-
- 120. Prototype for a typical zzd_ast macro which is invoked when destroying
- an AST node:
-
- #define zzd_ast(node) delete_ast(node)
-
- void delete_ast (AST * node);
-
- 121. Prototype for zzdef0 macro to initialize $0 of a rule:
-
- #define zzdef0(attr) define_attr_0 (attr)
-
- void define_attr_0 (Attrib *attr);
-
- 122. Prototype for ANTLR (these are actually macros):
-
- read from file: void ANTLR (void startRule(...),FILE *)
- read from string: void ANTLRs (void startRule(...),zzchar_t *)
- read from function: void ANTLRf (void startRule(...),int (*)())
- read from file: void ANTLRm
- (void startRule(...),FILE *,int lexclass)
-
- In the call to ANTLRf the function behaves like getchar()
- in that it returns EOF (-1) to indicate end-of-file.
-
- If ASTs are used or there is downward or upward inheritance then the
- call to the startRule must pass these arguments:
-
- AST *root;
- ANTLRf (startRule(&root),stdin);
- Page 44
-
- ===============================================================================
- Section on ANTLR/DLG Internals Routines That Might Be Useful
- -------------------------------------------------------------------------------
- ****************************
- ****************************
- ** **
- ** Use at your own risk **
- ** **
- ****************************
- ****************************
-
- 123. static int zzauto - defined in dlgauto.h
-
- Current DLG mode. This is used by zzmode() only.
-
- 124. void zzerr (char * s) defined in dlgauto.h
-
- Defaults to zzerrstd(char *s) in dlgauto.h
-
- Unless replaced by a user-written error reporting routine:
-
- fprintf(stderr,
- "%s near line %d (text was '%s')\n",
- ((s == NULL) ? "Lexical error" : s),
- zzline,zzlextext);
-
- 125. static char zzebuf[70] defined in dlgauto.h
- Page 45
-
- ===============================================================================
- Section on Known Minor Bugs in pccts
- -------------------------------------------------------------------------------
- 126. In the 8-Aug-94 distribution of version 1.21, line 434 of err.h
- should be changed:
-
- from: zzline = zzinf_line[zzinf_line];
- to: zzline = zzinf_line[zzinf_labase];
-
- Diagnosed (and fix) reported by Anders Sterholm (anvil@lysator.liu.se)
-
- 127. With version 1.20 (1-Apr-94) of ANTLR the -w2 switch will sometimes
- cause items to be reported as lacking an associated regular expression when
- they actually do have an associated regular expression. Sometimes
- #lexclass names and #tokclass names will also be reported. This is
- reported as fixed in 1.21.
-
- 128. The UPDATE.120 of (1-Apr-94) reports that there are problems in
- combining guess mode and semantic predicates under some circumstances.
-
- 129. In version 1.20 there is a bug in the 1-Apr-94 distribution of
- routine _zzsetmatch() which appears when used with ANTLR -gk (demand
- lookahead. This is reported as fixed in version 1.21.
-
- 130. In version 1.20 a rule which returns more than two arguments
- (via upward inheritance) generates code which cannot be compiled by
- a strict K&R compiler. It is accepted by ANSI compilers, however.
- This is reported as fixed in version 1.21. Reported by Joachim Schrod
- (schrod@iti.informatik.th-darmstadt.de).
-
- 131. There appears to be a bug in 1.20 and 1.21 when using syntactic
- predicates. Sometimes attributes are corrupted by backtracking when k=1
- under a combination of circumstances which are difficult to describe. A
- patch for this problem is reported in netnews note #421 and the problem
- will be fixed in release 1.22. This problem was reported by Sanjay
- Ghemawat (Sanjay@lcs.mit.edu)
-
- Page 46
-
- 132. The following problem was first reported by Carle Patrice
- (carle@litp.ibp.fr) in netnews note #347. TJP describes the problem
- in netnews note #428:
-
- It is related to lookahead computation for k=3 with a hoisted
- predicate. I have reduced the error to this case:
-
- a : (A B | B C) | b ;
-
- b : <<pred-c>>? B B ;
-
- It did not compute the correct lookahead decision for
- alternative one in rule "a". It should generate:
-
- if ( (LA(1)==A || LA(1)==B) && (LA(2)==B || LA(2)==C) &&
- !((LA(1)==B&&(LA(2)==B))) ) {
- ...
- }
- else blah
-
- *NOT* simply:
-
- if ( (LA(1)==A || LA(1)==B) && (LA(2)==B || LA(2)==C) ) {
- ...
- }
- else blah
-
- This will be fixed in version 1.22.
-
- Page 47
-
- 133. In version 1.10 there was a bug in the hoisting of semantic
- predicates when all alternatives of a rule had the same lookahead token.
- I believe this has been fixed in versions >= 1.20. Consider the following
- (compiled with -prc off -k 1 -ck 3):
-
- ;obj_name
- : global_func_id
- | simple_class SUFFIX_DOT
- | ID
-
- ;simple_class
- : <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID
-
- ;global_func_id
- : <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID
-
- In version 1.2x the generated code resembles:
-
- void obj_name(void) {
- if ( (LA(1)==ID) &&
- (LA(1)==ID ? isFunction(LATEXT(1)) : 1)) ) {
- global_func_id();
- } else {
- if ( (LA(1)==ID) &&
- (LA(2)==SUFFIX_DOT) ) {
- simple_class();
- } else {
- if ( (LA(1)==ID) ) {
- zzmatch(ID);
- ...
-
- The workaround, previously, was to precede semantic predicates with an
- action (such as "<<;>>") which inhibits hoisting into the prediction
- expression.
- Page 48
-
- #header
- <<
- #include "charptr.h"
- >>
-
- #token QUESTION
- #token COMMA
- #token ID
-
- #token Eof "@"
-
- #token SUFFIX_DOT
-
- statement
- : (information_request)?
- | obj_name
-
- ;information_request
- : QUESTION id_list ( QUESTION )*
- | id_list ( QUESTION )+
-
- ;id_list
- : obj_name ( COMMA | obj_name )*
-
- ;simple_class
- : <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID
-
- ;global_func_id
- : <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID
-
- ;obj_name
- : <<;>> global_func_id
- | <<;>> simple_class SUFFIX_DOT
- | ID
- ;
- Page 49
-
- ===============================================================================
- Ideas on the Construction of ASTs and their use with Sorcerer
- -------------------------------------------------------------------------------
- Consider the problem of a grammar which would normally require two
- passes through the source code to properly analyze. In some cases
- it is convenient to perform a first pass which creates AST trees
- and perform the second pass by analyzing the AST trees with Sorcerer.
-
- 1) Define an AST node that contains the information you'll need in the
- second pass. For example,
-
- /*
- * Parse trees are represented by an abstract-syntax-tree (AST)
- * (forward declare the pointer here). Refer to parse.h for description
- * of parse_info.
- */
- typedef struct parse_struct *ast_ref;
-
- /* parser attributes ($-symbols) & AST nodes */
- typedef struct parse_struct *pinfo_ref;
-
- /*
- * the parse structure is used to describe both attributes and
- * AST nodes
- */
-
- struct parse_struct {
- pinfo_ref right; /* points to siblings */
- pinfo_ref down; /* points to children */
- int token; /* token number (see tokens.h) */
- char *text; /* input text */
- src_pos pos; /* position in source file */
- object_ref obj; /* object description (id's) */
- type_ref typ; /* type description (expr's) */
- const_value value; /* value of a constant expression */
- } ;
-
- /*
- * define Abstract Syntax Tree (AST) nodes
- */
-
- /* ast_ref was forward-defined */
- typedef struct parse_struct AST;
-
- /*
- * the Pass-1 (parse phase) parse-attributes ($-variables)
- * have the same structure as an AST node.
- */
- typedef struct parse_struct Attrib, *Attrib_ref;
-
-
- In the code above, the parse-attribute was defined to have the same
- structure as an AST node. This isn't a requirement, but just makes it
- easier to pass information produced in the first pass on to subsequent
- passes.
- Page 50
-
- 2) Have the first pass build a symbol table as it parses the input, perform
- semantic checks, and build an AST. Use the -gt (generate tree) option on
- ANTLR, and override the automatically generated tree construction operations
- as needed. For example,
-
- var_declare:
- << pvec_ref v_list;
- int i;
- boolean has_var_section = FALSE;
- >>
- VAR^
- (
- var_id_list > [v_list] COLON
- { extern_kw
- | static_kw
- }
- type
- <<
- for (i = 0; i < v_list->len; ++i) {
- object_ref v = (object_ref) v_list->val[i];
- define_var(v, $4.typ);
- }
- >>
- { ASSIGNMENT expr
- << mark_var_use(#2, VAR_RHS); >>
- }
- SEMI
- << free_pvec(v_list); >>
- )+
- ;
- var_id_list > [pvec_ref v_list]:
- << object_ref this_var;
- $v_list = new_pvec();
- >>
- ID
- << this_var = new_var_id(&$1);
- if (this_var != NULL) append_pvec($v_list, (void *)this_var);
- >>
- (
- COMMA ID
- << this_var = new_var_id(&$2);
- if (this_var != NULL) append_pvec($v_list, (void *)this_var);
- >>
- )*
- ;
-
- The "pvec" stuff above is just a vector of pointers that can be
- extended automatically. A linked list would work just as well. The
- idea is that we must first collect the declared variables, then
- parse the type declaration, then apply bind the type to the declared
- variables. We used ANTLR's auto-tree-generation mode, and didn't
- override its actions with our own. Therefore, the following Sorcerer
- fragment will recognize the AST built for a variable declaration:
- Page 51
-
- var_declare:
- #( VAR
- ( v_list: var_id_list COLON
- { EXTERN | STATIC }
- type
- { ASSIGNMENT expr }
- SEMI
- )+
- )
- ;
- var_id_list:
- ID ( COMMA ID)*
- ;
-
- Here's an example, where we use explicit rules to build an AST:
-
- expr!:
- simple_expr
- << $expr = $1; #0 = #1; >>
- ( rel_op simple_expr
- << parse_binary_op(&$expr, &$1, &$2); #0 = #(NULL, #0, #1, #2); >>
- )*
- << $expr.token = EXPR;
- $expr.text = "expr";
- #0 = #(#[&$expr], #0);
- >>
- ;
-
- The construct, #[&$expr] first takes the address of the $expr
- attribute (attributes are structures, not pointers, in this example),
- and then applies the #[] operation which makes a call to the routine
- that creates an AST node, given an attribute (or attribute address
- in our case). It takes a while to get the hang of where the &'s
- #'s, and $'s go, but can be a real time-saver once you master it.
- What we're doing above is building a special EXPR (expression) node.
- This node would be parsed as follows in subsequent passes, using
- Sorcerer:
-
- expr: #( e: EXPR
- l_oprnd: simple_expr << e->typ = l_oprnd->typ; >>
- (op: rel_op r_oprnd: simple_expr
- <<
- e->typ = std_bool_type->obj_type;
- if (op->token == IN) {
- /* no type conversion checking for IN.
- * try to rewrite simple IN ops.
- */
- if (is_simple_in_op(l_oprnd, op, r_oprnd)) {
- rewrite_simple_in_op(l_oprnd, op, r_oprnd);
- }
- } else {
- cvt_term(&l_oprnd, op, r_oprnd, _t);
- }
- >>
- )*
- );
- Page 52
-
- We left in the actual actions of the second (Sorcerer driven) pass.
- Notice how the Sorcerer grammar labels various parts of the expr
- node ('e', 'l_oprnd', 'op', and 'r_oprnd'). This gives the second
- pass access to each node, as it is recognized.
-
- The second pass uses the 'typ' field, which contains the type of
- the ID, expression, or literal parsed by the first pass. In the
- actions above, we are propagating additional type information (for
- example, the result of a relational op is always a boolean, checking
- for implicit type conversions, and handling simple cases of Pascal's
- IN operation). The fragment above is from a Pascal to Ada translator,
- so the translator has to make Pascal's implicit type conversions
- between integer and real into explicit Ada type conversions, and
- has to convert operations on sets (i.e. IN) into operations on
- packed boolean arrays, in Ada, or calls to runtime routines.
-
- Sometimes when you are building the AST for a given construct,
- you need to use information gained from semantic analysis. An
- example is the "assignment" or "call" statement:
-
- /*
- * If a variable access appears alone, then it must be either a call to
- * procedure with no parameters, or an indirection through a pointer
- * to a procedure with no parameters.
- */
- assign_or_call_stmt!:
- << type_ref r_type = NULL;
- ast_ref v;
- >>
- variable
- << v = #1;
- $assign_or_call_stmt = $1;
- $assign_or_call_stmt.token = PROC_CALL;
- $assign_or_call_stmt.text = "proc_call";
- r_type = $assign_or_call_stmt.typ;
- if (v != NULL && v->obj != NULL
- && v->obj->obj_result != NULL
- && v->obj->obj_kind == func_obj
- && v->down->token == ID
- && v->down->right == NULL) {
- object_ref func = v->obj;
- /* function name used on left hand side;
- * convert to reference to the function's return value
- */
- v->obj = func->obj_result;
- v->typ = func->obj_result->obj_type;
- v->down->text = func->obj_result->obj_name;
- v->down->obj = func->obj_result;
- v->down->typ = func->obj_result->obj_type;
- }
- #0 = v;
- >>
- Page 53
-
- { ASSIGNMENT expr
- << $assign_or_call_stmt.token = ASSIGNMENT;
- $assign_or_call_stmt.text = ":=";
- mark_var_use(#2, VAR_RHS);
- mark_var_use(v, VAR_LHS);
- #0 = #(NULL, #0, #[&$1], #2);
- >>
- | ( LPAREN actual_param_list RPAREN
- << mark_actual_param_use(#2, r_type);
- mark_var_use(v, VAR_RHS);
- #0 = #(NULL, #0, #[&$1], #2, #[&$3]);
- >>
- )
- }
- << #0 = #( #[&$assign_or_call_stmt], #0); >>
- ;
-
-
- The problem we're solving is that both an assignment statement and
- a procedure call statement begin with a "variable". Since ANTLR
- is LL-based, this statement construct is "ambiguous" in that
- both statement types (assignment and call) begin with the same
- non-terminal. A "variable" includes operations such operations
- as array subscripting, pointer deferencing, and record field selection.
- Thus, a "variable" may comprise an arbitrary number of tokens.
-
- We might use syntactic predicates as a form a look-ahead to resolve
- the two cases above, but instead I decided to make the assumption that
- we have "PROC_CALL", and to correct that "guess" once we see the
- assignment operation. Thus, the above rule will build one of the
- following two AST structures:
-
- assign_stmt:
- #( ASSIGNMENT
- variable ASSIGNMENT expr
- )
- ;
- call_stmt:
- #( PROC_CALL
- variable {LPAREN actual_param_list RPAREN}
- )
- ;
-
- In your AST, you might want to drop unnecessary syntactic tokens such as
- ASSIGNMENT, LPAREN, RPAREN, COMMA, COLON, etc. We kept them, because we
- thought it would be necessary for certain parts of source-to-source
- translation. We don't think that's true, any more, but have not gone back
- and changed the AST structure either.
- Page 54
-
- 3) Build a separate Sorcerer grammar file to recognize the AST that you
- have built, and then add your second pass actions. These actions will
- access fields in the AST node, that were filled in by the first pass. For
- example, identifiers will probably have an "object_ref" that points to the
- object named by the identifier, and expression (EXPR) nodes will have a
- "typ" field that gives the expression's type. You might also add a "value"
- field that gives the value of a literal, named literal, or statically
- evaluated constant expression. See the code fragments above for some ideas
- on how this is done.
-
- Conclusions:
-
- 1) You'll need an ANTLR (.g) description for pass1, and a separate
- Sorcerer (.sor) description for pass2. Often the pass2 AST
- representation is much more regular and well-formed than the
- original text token stream used by pass1.
-
- 2) It can be a bit intimidating putting the pieces together.
- Try it incrementally, trying a small subset of your larger
- problem.
-
- 3) There are a lot of ways to go with how you represent
- attributes ($-variables), AST nodes, and the things that
- go on in various passes. For example, you might have
- pass1 simply build the AST and perform *no* symbol definitions
- or semantic checks. Then pass2 might walk the tree and build
- the symbol, and make various checks. Pass2 might also disambiguate
- cases that look syntactically similar, and can only be disambiguated
- using symbol definitions. Then, you could have a pass3 (another
- Sorcerer driven tree-walk) that does the 'real work' of your
- compiler/translator.
-
- Contributed by Gary Funck (gary@intrepid.com)
- Page 55
-
- ===============================================================================
- Example 1 of #lexclass
- ===============================================================================
- Borrowed code
- -------------------------------------------------------------------------------
- /*
- * Various tokens
- */
- #token "[\t\ ]+" << zzskip(); >> /* Ignore whitespace */
- #token "\n" << zzline++; zzskip(); >> /* Count lines */
-
- #token "\"" << zzmode(STRINGS); zzmore(); >>
- #token "'" << zzmode(CHARACTERS); zzmore(); >>
- #token "/\*" << zzmode(COMMENT); zzskip(); >>
- #token "//" << zzmode(CPPCOMMENT); zzskip(); >>
-
- /*
- * C++ String literal handling
- */
- #lexclass STRINGS
- #token STRING "\"" << zzmode(START); >>
- #token "\\\"" << zzmore(); >>
- #token "\\n" << zzreplchar('\n'); zzmore(); >>
- #token "\\r" << zzreplchar('\r'); zzmore(); >>
- #token "\\t" << zzreplchar('\t'); zzmore(); >>
- #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10));
- zzmore(); >>
- #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8));
- zzmore(); >>
- #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16));
- zzmore(); >>
- #token "\\~[\n\r]" << zzmore(); >>
- #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >>
- #token "~[\"\n\r\\]+" << zzmore(); >>
-
- /*
- * C++ Character literal handling
- */
- #lexclass CHARACTERS
- #token CHARACTER "'" << zzmode(START); >>
- #token "\\'" << zzmore(); >>
- #token "\\n" << zzreplchar('\n'); zzmore(); >>
- #token "\\r" << zzreplchar('\r'); zzmore(); >>
- #token "\\t" << zzreplchar('\t'); zzmore(); >>
- #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10));
- zzmore(); >>
- #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8));
- zzmore(); >>
- #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16));
- zzmore(); >>
- #token "\\~[\n\r]" << zzmore(); >>
- #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >>
- #token "~[\'\n\r\\]" << zzmore(); >>
- Page 56
-
- /*
- * C-style comment handling
- */
- #lexclass COMMENT
- #token "\*/" << zzmode(START); zzskip(); >>
- #token "~[\*]*" << zzskip(); >>
- #token "\*~[/]" << zzskip(); >>
-
- /*
- * C++-style comment handling
- */
- #lexclass CPPCOMMENT
- #token "[\n\r]" << zzmode(START); zzskip(); >>
- #token "~[\n\r]" << zzskip(); >>
-
- #lexclass START
-
- /*
- * Assorted literals
- */
- #token OCT_NUM "0[0-7]*"
- #token L_OCT_NUM "0[0-7]*[Ll]"
- #token INT_NUM "[1-9][0-9]*"
- #token L_INT_NUM "[1-9][0-9]*[Ll]"
- #token HEX_NUM "0[Xx][0-9A-Fa-f]+"
- #token L_HEX_NUM "0[Xx][0-9A-Fa-f]+[Ll]"
- #token FLOAT_NUM "([1-9][0-9]*{.[0-9]*} | {0}.[0-9]+) {[Ee]{[\+\-]}[0-9]+}"
-
- /*
- * Identifiers
- */
- #token Identifier "[_a-zA-Z][_a-zA-Z0-9]*"
- Page 57
-
- ===============================================================================
- Example 2: ASTs
- ===============================================================================
- #header <<
-
- #include "charbuf.h"
- #include <string.h>
-
- int nextSerial;
-
- #define AST_FIELDS int token; int serial; char *text;
- #include "ast.h"
-
- #define zzcr_ast(ast,attr,tok,astText) \
- (ast)->token=tok; \
- (ast)->text=strdup( (char *) &( ( (attr)->text ) ) ); \
- nextSerial++; \
- (ast)->serial=nextSerial; \
-
- #define zzd_ast(node) delete_ast(node)
-
- void delete_ast (AST *node);
-
- >>
-
- <<
-
- AST *root=NULL;
-
- void show(AST *tree) {
- if (tree->token==ID) {
- printf (" %s <#%d> ",
- tree->text,tree->serial);}
- else {
- printf (" %s <#%d> ",
- zztokens[tree->token],
- tree->serial);
- };
- }
- void before (AST *tree) {
- printf ("(");
- }
- void after (AST *tree) {
- printf (")");
- }
-
-
- void delete_ast(AST *node) {
- printf ("\nzzd_ast called for <node #%d>\n",node->serial);
- free (node->text);
- return;
- }
-
- int main() {
- nextSerial=0;
- ANTLR (expr(&root),stdin);
- printf ("\n");
- zzpre_ast(root,show,before,after);
- printf ("\n");
- zzfree_ast(root);
- return 0;
- }
- >>
- Page 59
-
- #token WhiteSpace "[\ \t]" <<zzskip();>>
- #token ID "[a-z A-Z]*"
- #token NEWLINE "\n"
- #token OpenAngle "<"
- #token CloseAngle ">"
-
- expr : (expr0 NEWLINE)
-
- ;expr0 : expr1 {"="^ expr0}
- ;expr1 : expr2 ("\+"^ expr2)*
- ;expr2 : expr3 ("\*"^ expr3)*
- ;expr3 : ID
- -------------------------------------------------------------------------------
- Sample output from this program:
-
- a=b=c=d
- ( = <#2> a <#1> ( = <#4> b <#3> ( = <#6> c <#5> d <#7> ))) NEWLINE <#8>
- zzd_ast called for <node #7>
- zzd_ast called for <node #5>
- zzd_ast called for <node #6>
- zzd_ast called for <node #3>
- zzd_ast called for <node #4>
- zzd_ast called for <node #1>
- zzd_ast called for <node #8>
- zzd_ast called for <node #2>
-
- a+b*c
- ( \+ <#2> a <#1> ( \* <#4> b <#3> c <#5> )) NEWLINE <#6>
- zzd_ast called for <node #5>
- zzd_ast called for <node #3>
- zzd_ast called for <node #4>
- zzd_ast called for <node #1>
- zzd_ast called for <node #6>
- zzd_ast called for <node #2>
-
- a*b+c
- ( \+ <#4> ( \* <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6>
- zzd_ast called for <node #3>
- zzd_ast called for <node #1>
- zzd_ast called for <node #5>
- zzd_ast called for <node #2>
- zzd_ast called for <node #6>
- zzd_ast called for <node #4>
- Page 61
-
- ===============================================================================
- Example 3: Syntactic Predicates
- ===============================================================================
- Not completed.
- ===============================================================================
- Example 4: DLG input function
- ===============================================================================
- This example demonstrates the use of a DLG input function to work
- around a limitation of DLG. In this example the user wants to
- recognize an exclamation mark as the first character of a line and
- treat it differently from an exclamation mark elsewhere. The
- workaround is for the input function to return a non-printing
- character (binary 1) when it finds an "!" in column 1. If it reads a
- genuine binary 1 in column 1 of the input text it returns a "?".
-
- The parse is started by:
-
- int DLGchar (void);
- ...
- ANTLRf (expr(&root),DLGchar);
- ...
- -------------------------------------------------------------------------------
- #token BANG "!"
- #token BANG_COL1 "\01"
- #token WhiteSpace "[\ \t]" <<zzskip();>>
- #token ID "[a-z A-Z]*"
- #token NEWLINE "\n"
-
- expr! : (bang <<printf ("\nThe ! is NOT in column 1\n");>>
- | bang1 <<printf ("\nThe ! is in column 1\n");>>
- | id <<printf ("\nFirst token is an ID\n");>>
- )* "@"
-
- ;bang! : BANG ID NEWLINE
-
- ;bang1! : BANG_COL1 ID NEWLINE
-
- ;id! : ID NEWLINE
- ;
- -------------------------------------------------------------------------------
- Page 62
-
- #include <stdio.h>
-
- /*
- Antlr DLG input function - See page 18 of pccts 1.00 manual
- */
-
- static int firstTime=1;
-
- static int c;
-
- int DLGchar (void) {
- if (feof(stdin)) {
- return EOF;
- };
- if (firstTime || c=='\n') {
- firstTime=0;
- c=fgetc(stdin);
- if (c==EOF) return (EOF);
- if (c=='!') return ('\001');
- if (c=='\001') return ('?');
- return (c);
- } else {
- c=fgetc(stdin);
- return (c);
- };
- };
- Page 63
-
- ===============================================================================
- Example 5: Maintaining a Stack of DLG Modes
- ===============================================================================
- Contributed by David Seidel
- Normally this should be appended to err.h
- -------------------------------------------------------------------------------
- #define MAX_MODE ????
- #define ZZMAXSTK (MAX_MODE * 2)
-
- static int zzmstk[ZZMAXSTK] = { -1 };
- static int zzmdep = 0;
-
- void
- #ifdef __STDC__
- zzmpush( int m )
- #else
- zzmpush( m )
- int m;
- #endif
- {
- if(zzmdep == ZZMAXSTK - 1)
- { sprintf(zzebuf, "Mode stack overflow ");
- zzerr(zzebuf);
- }
- else
- { zzmstk[zzmdep++] = zzauto;
- zzmode(m);
- }
- }
-
- void
- zzmpop()
- {
- if(zzmdep == 0)
- { sprintf(zzebuf, "Mode stack underflow ");
- zzerr(zzebuf);
- }
- else
- { zzmdep--;
- zzmode(zzmstk[zzmdep]);
- }
- }
- Page 64
-
- -------------------------------------------------------------------------------
- A modified version of the above routine which allows the user to pass a
- "lookahead" routine to be called when the mode stack is popped. Normally
- this will be appended to err.h.
- -------------------------------------------------------------------------------
- #define ZZMAXSTK ????
-
- static int zzmstk[ZZMAXSTK] = { -1 }; /* stack of DLG modes */
- static void (*zzfuncstk[ZZMAXSTK])(); /* stack of pointer to functions */
- static int zzmdep = 0;
-
- void pushMode( int m ,void (*func)())
- {
- if(zzmdep == ZZMAXSTK - 1)
- { sprintf(message, "Mode stack overflow ");
- zzerr(message);
- }
- else
- { zzmstk[zzmdep] = zzauto;
- zzfuncstk[zzmdep] = func;
- zzmdep++;
- zzmode(m);
- }
- }
-
- void popMode()
- {
- void (*thisFunc)();
- if(zzmdep == 0)
- { sprintf(message, "Mode stack underflow ");
- zzerr(message);
- }
- else
- { zzmdep--;
- thisFunc=zzfuncstk[zzmdep];
- zzmode(zzmstk[zzmdep]);
- zzmstk[zzmdep]=0;
- zzfuncstk[zzmdep]=0;
- /* this call might result in indirect recursion of popMode() */
- if (thisFunc!=0) {
- (*thisFunc)();
- };
- }
- }
-
- void resetModeStack() {
- zzmdep=0;
- zzmstk[0]=0;
- zzfuncstk[0]=0;
- }
-
- /* if the lookahead character is a semi-colon then keep on popping */
-
- void popOnSC() {
- if (zzchar==';') popMode();
- }
- Page 65
-
- ===============================================================================
- Example 6: Debug code for create_ast, mk_ast, delete_ast to locate lost ASTs
- ===============================================================================
- This is an example of code which tries to keep track of lost ASTs using
- a doubly linked list of all ASTs maintained by calls from create_ast()
- and mk_ast() to zzastnew_userhook(). When ASTs are deleted by calls
- to zzastdelete_userhook() from the user's AST delete routines they are
- removed from the doubly linked list. Any ASTs left over after zzfree_ast()
- must be considered lost.
-
- This method does not monitor ASTs created by zzdup_ast() because it does
- not call the create_ast() or mk_ast() routines.
- -------------------------------------------------------------------------------
- The #header section must include a definition of AST_FIELDS with the
- equivalent of:
-
- struct _ast *flink, *blink;
- -------------------------------------------------------------------------------
- int main() {
- ...
- again:
- ...
- reset_ASTlistHead(); /* <======================== */
- ANTLR (sourcecode(&root),stdin);
- treewalk(root);
- zzfree_ast(root);
- root=NULL;
- print_lost_ast(); /* <======================= */
- printf ("\n");}
- ...
- goto again;
- ...
- }
- -------------------------------------------------------------------------------
- #ifndef H_ZZNEWAST_USERHOOK
- #define H_ZZNEWAST_USERHOOK
-
- void reset_ASTlistHead(void);
- void zzunhook_tree(void);
- void zzastnew_userhook(AST *newNode);
- void zzastdelete_userhook (AST *singleNode);
- void print_lost_ast (void);
- void treewalk(AST *tree);
-
- #endif
- -------------------------------------------------------------------------------
- #include "stdpccts.h"
- #include "stdlib.h"
- #include "zzastnew_userhook.h"
-
- static AST ASTlistHead;
- static int ASTserialNumber;
-
- void reset_ASTlistHead(void) {
- while (ASTlistHead.flink!=0 && ASTlistHead.flink!= &ASTlistHead) {
- zzfree_ast(ASTlistHead.flink);
- };
- ASTlistHead.flink=&ASTlistHead;
- ASTlistHead.blink=&ASTlistHead;
- ASTserialNumber=1;
- return;
- }
- Page 66
-
- /* Stop tracking ASTs in a tree without actually deleting them */
-
- void zzunhook_tree (AST * tree) {
-
- while (tree != 0) {
- zzunhook_tree (tree->down);
- zzastdelete_userhook (tree);
- tree=tree->right;
- };
- return;
- }
-
- /* Track new AST */
-
- void zzastnew_userhook(AST *newNode) {
-
- AST *prev;
-
- prev=ASTlistHead.blink;
- prev->flink=newNode;
- ASTlistHead.blink=newNode;
- newNode->blink=prev;
- newNode->flink=&ASTlistHead;
- newNode->serialNumber=ASTserialNumber;
- ASTserialNumber++;
- return;
- }
-
- /* Stop tracking an AST */
-
- void zzastdelete_userhook (AST *singleNode) {
-
- AST *fnode;
- AST *bnode;
-
- if (singleNode!=0) {
- fnode=singleNode->flink;
- bnode=singleNode->blink;
- fnode->blink=bnode;
- bnode->flink=fnode;
- singleNode->serialNumber=0;
- singleNode->flink=0;
- singleNode->blink=0;
- };
- return;
- }
-
- /* Print ASTs that are still on list */
-
- void print_lost_ast () {
-
- AST *node;
-
- for (node=ASTlistHead.flink;
- node!=0 && node!= &ASTlistHead;
- node=node->flink) {
- printf ("**** Start of lost AST listing **** %d\n",node->serialNumber);
- treewalk (node); /* user supplied routine */
- printf ("\n**** End of lost AST listing ****\n");
- };
- }
- Page 68
-
- -------------------------------------------------------------------------------
- These routines print out the AST tree. This will be application dependent.
- -------------------------------------------------------------------------------
- #include "stdpccts.h"
- #include "stdlib.h"
-
- static int treenest=0;
-
- void treeindent(int nesting) {
- int i;
- for (i=0;i<nesting*2;i++) {
- printf (" ");
- };
- return;
- }
-
- void treewalk1 (AST *tree) {
- while (tree != NULL) {
- treeindent(treenest);
- printf ("%s",zztokens[tree->token]);
- if (tree->text != NULL) {
- printf (" %s",tree->text);
- };
- printf ("\n");
- treenest++;
- treewalk1 (tree->down);
- treenest--;
- tree=tree->right;
- };
- return;
- }
-
- void treewalk (AST *tree) {
- treenest=0;
- treewalk1(tree);
- return;
- }
- -------------------------------------------------------------------------------
-